Data Lake and Practise on AWS

In the software industry, automation and innovation are 2 biggest core company competitions. Nowadays, we are fascinated by AI or Machine Learning’s ambitions, but on the other side, the reality is a lot of jobs or tasks which is still moving data from here to there, digital transformation, refactoring legacy system in the big enterprise organisation. So building any kinds of pipelines are the key to accelerate the enterprise to implement their business strategy. Today topic is only data pipeline, remember the time I was using Oracle X-data in Shanghai Telecom, it makes sense to summary my thinking of data pipeline.

1. Challenges

Variety, Velocity, Volume are the 3V of the increasing data challenge, remember 5 years in Hadoop system of my company, the Volume for 3TB/Day for 5 million Shanghai mobile user is the biggest difficult task, but now Evolution of big data is coming into batch processing and stream processing, even how to integrate with Artifical Intelligence components in 1 pipeline. With rapidly cloud technology developing, VM -> Mircoservcie->Serverless, building a big data platform is much easier 5 years ago, buy hardware and install all Hadoop components on-premises system was terrible experiences, with AWS RMR, anyone can get Hadoop or Spark cluster in 5 minutes. However, there are too many tools in the market, whatever from cloud solutions or open source solutions.

2. Architectural Principles

Build decoupled systems

DataSet – > Store -> Process -> Store – >Analyze ->Answers.

Choose the right tools or library for job

the data structure of the storage
Latency acceptance
throughput requirements

Centric and secure Log patterns

immutable logs or data lake
data protection of user log with GDPR

Be cost-conscious

By as you go, no hardware
Big data != big cost

Integration with AI/ML

using AI to answer questions
AI/ML-based data platform

3. Simplify Big Data Processing

Screen Shot 2018-06-06 at 23.22.36.png

3. Data Temperature Characteristics

What data store should we use?

DataStructure: Fix-Schema, Schema-Free, JSON, KEY/Value
Access Pattern: Store Data in the format you will access it
Data Characteristics: hot -> warm -> cold
Cost: right cost

Amazon Components Compare in data store

Build realtime analysis system on AWS

Interactive and batch analysis

All together in Amazon Data Lake

Data Lake Reference Architecture

Summary

Building decoupled system: data -> store-> Process -> Store_> Analyse-> Answers
Use the right tool for the job
- Data Structure
- Latency
- Throughput
- Access pattern
Use Log-centric design patterns
- Immutable logs, data lake, materialised views
Be cost-conscious
- Big data != Big cost
AL/ML enable your applications

References: AWS 2017 Big data data Architectural patterns and best Practices on AWS

https://www.youtube.com/watch?v=a3713oGB6Zk

Qunfei@Data, ML and AI Architecture

recent posts

about