Skip to main content

Data Engg. - GDELT Analatics ELT

As Part of a certification I created a very detailed ELT (Extract Load Transform) Project which would allow analyzing World Events through various parameters.

  • Data Engg.
  • Software Development
  • Data Analatics
  • Solution Design
  • Architecture
  • Devops
The aero lesson builder app dragging an audio component into a screen about plant cells.

The Problem

To Analyze the Global Event Database for analysis of events for Countries and the Sentiments the events cause to Humans. Analyze the Data available and Insights it provides. Use ELT to pull raw data from gdelt source and analyze it after transformations and uploading the Data

A set of light themed components for the aero design system

Solution

Batch Loads GDELT data which is scalable through Dash task scheduler, Further the data cleaned up and is stored in an object storage. Later on the data is loaded to an OLAP database. Once Data is loaded, Data transformation is done to create materialized views using DBT. And further I did a bunch of analysis on the event data on Data Visualization tool Metabase

The homepage of the aero design system docs website linking to principles and components.

Tech Stack

  • Cloud: Azure - Azure is another option to AWS and its Visual Studio Benefits is one of the reason to choose this
  • Infrastructure as code (IaC): Terraform- A very mature IaC tool. And it supports lot of providers
  • Workflow orchestration: Prefect - Its easy and natural to setup the workflows
  • - Data Lake - For Data Lake using Minio An Alternative to S3- This is one of the best opensource tools. And this also makes the most of latency and scale
  • Data Warehouse: Clickhouse is used as a Data Warehouse. its an OLAP Database and extermely space efficienct and performant
  • Batch Processing: Dask Its a concurrent task scheduler where well Integrated with Prefect. This is alternate to Spark
  • Infrastructure Orchestrator: Nomad a Kubernetes alternative
  • Data Analytics Tool: Metabase Its another Opensource tool to visualize the Data

Functional Features

  • Data can be loaded for particular time frame
  • Metabase allows intuitive analysis of data with intutive sql queries
  • Data is cleaned and analysis can be done with multiple join queries
  • The response query response is low

A drag and drop storyboard style editor for creating an adaptive lesson.

Non-Functional Features

  • Data can be loaded in parallel with multiple instances of dask
  • Project is Enterprise grade from ci/cd automation, infra as code, scalablity and very coosen stack
  • Cost is taken into consideration so DB and Object storage both are opensource and reside in same network, so on egress, ingress costs, while also providing fast response times
  • All tools are opensource and one of the best in class

A drag and drop storyboard style editor for creating an adaptive lesson.

Project Outcome

I learnt a lot of postives as well as drawbacks in handling data at such scale, including the batch size, considering network speed as well as data load speeds. Analysis was fine, though the data throughput required for loading the actual data from source was also bottleneck. The project remained great success as it allowed me on getting into the whole setup from ground zero and all aspects of project Engg, Product, Devops, Data Analatics.

A drag and drop storyboard style editor for creating an adaptive lesson.