In the software industry, automation and innovation are 2 biggest core company competitions. Nowadays, we are fascinated by AI or Machine Learning’s ambitions, but on the other side, the reality is a lot of jobs or tasks which is still moving data from here to there, digital transformation, refactoring legacy system in the big enterprise organisation. So building any kinds of pipelines are the key to accelerate the enterprise to implement their business strategy. Today topic is only data pipeline, remember the time I was using Oracle X-data in Shanghai Telecom, it makes sense to summary my thinking of data pipeline.
1. Challenges
Variety, Velocity, Volume are the 3V of the increasing data challenge, remember 5 years in Hadoop system of my company, the Volume for 3TB/Day for 5 million Shanghai mobile user is the biggest difficult task, but now Evolution of big data is coming into batch processing and stream processing, even how to integrate with Artifical Intelligence components in 1 pipeline. With rapidly cloud technology developing, VM -> Mircoservcie->Serverless, building a big data platform is much easier 5 years ago, buy hardware and install all Hadoop components on-premises system was terrible experiences, with AWS RMR, anyone can get Hadoop or Spark cluster in 5 minutes. However, there are too many tools in the market, whatever from cloud solutions or open source solutions.
2. Architectural Principles
Build decoupled systems
DataSet – > Store -> Process -> Store – >Analyze ->Answers.
Choose the right tools or library for job
- the data structure of the storage
- Latency acceptance
- throughput requirements
Centric and secure Log patterns
- immutable logs or data lake
- data protection of user log with GDPR
Be cost-conscious
- By as you go, no hardware
- Big data != big cost
Integration with AI/ML
- using AI to answer questions
- AI/ML-based data platform
3. Simplify Big Data Processing

3. Data Temperature Characteristics

What data store should we use?
- DataStructure: Fix-Schema, Schema-Free, JSON, KEY/Value
- Access Pattern: Store Data in the format you will access it
- Data Characteristics: hot -> warm -> cold
- Cost: right cost

Amazon Components Compare in data store
Build realtime analysis system on AWS

Interactive and batch analysis

All together in Amazon Data Lake

Data Lake Reference Architecture

Summary
- Building decoupled system: data -> store-> Process -> Store_> Analyse-> Answers
- Use the right tool for the job
- Data Structure
- Latency
- Throughput
- Access pattern
- Use Log-centric design patterns
- Immutable logs, data lake, materialised views
- Be cost-conscious
- Big data != Big cost
- AL/ML enable your applications
References: AWS 2017 Big data data Architectural patterns and best Practices on AWS

Leave a comment