Data Lake and Practise on AWS

In the software industry,  automation and innovation are 2 biggest core company competitions.  Nowadays, we are fascinated by AI or Machine Learning’s ambitions, but on the other side,  the reality is a lot of jobs or tasks which is still moving data from here to there, digital transformation, refactoring legacy system in the big enterprise organisation.  So building any kinds of pipelines are the key to accelerate the enterprise to implement their business strategy.  Today topic is only data pipeline, remember the time I was using Oracle X-data in Shanghai Telecom, it makes sense to summary my thinking of data pipeline.

1. Challenges

Variety, Velocity, Volume are the 3V of the increasing data challenge, remember 5 years in Hadoop system of my company, the Volume for  3TB/Day for 5 million Shanghai mobile user is the biggest difficult task, but now Evolution of big data is coming into batch processing and stream processing, even how to integrate with Artifical Intelligence components in 1 pipeline. With rapidly cloud technology developing,  VM -> Mircoservcie->Serverless, building a big data platform is much easier 5 years ago, buy hardware and install all Hadoop components on-premises system was terrible experiences, with AWS RMR, anyone can get Hadoop or Spark cluster in 5 minutes. However, there are too many tools in the market, whatever from cloud solutions or open source solutions.

2. Architectural Principles

Build decoupled systems

DataSet – > Store -> Process -> Store – >Analyze ->Answers.

Choose the right tools or library for job

  • the data structure of the storage
  • Latency acceptance
  • throughput requirements

Centric and secure Log patterns

  • immutable logs or data lake
  • data protection of user log with GDPR

Be cost-conscious

  • By as you go, no hardware
  • Big data != big cost

Integration with AI/ML

  • using AI to answer questions
  • AI/ML-based data platform

3. Simplify Big Data Processing

Screen Shot 2018-06-06 at 23.22.36.png

3. Data Temperature Characteristics

What data store should we use?

  • DataStructure: Fix-Schema, Schema-Free,  JSON, KEY/Value
  • Access Pattern: Store Data in the format you will access it
  • Data Characteristics:  hot -> warm -> cold
  • Cost: right cost

Amazon Components Compare in data store

Build realtime analysis system on AWS

Interactive and batch analysis

All together in Amazon Data Lake

Data Lake Reference Architecture

Summary

  • Building decoupled system:  data -> store-> Process -> Store_> Analyse-> Answers
  • Use the right tool for the job
    • Data Structure
    • Latency
    • Throughput
    • Access pattern
  • Use Log-centric design patterns
    • Immutable logs, data lake, materialised views
  • Be cost-conscious
    • Big data != Big cost
  • AL/ML enable your applications

References: AWS 2017 Big data data Architectural patterns and best Practices on AWS

https://www.youtube.com/watch?v=a3713oGB6Zk

Posted in

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.