← Back to Projects

Market Data Pipeline (ETL System)

  • A production-grade real-time market data ingestion and processing pipeline, designed to demonstrate strong capabilities in data engineering, streaming systems, and automated ETL workflows used in modern trading platforms.

This project complements my full HFT trading engine by providing the underlying data flow, storage, and monitoring infrastructure that a real trading system depends on.

🎯 Goal

Build a robust, scalable data infrastructure that can continuously collect, normalize, store, and monitor financial market data in real time β€” following the same architecture used by professional trading firms.

🧰 Tech Stack

  • Python β€” ETL logic, data normalization, ingestion workers
  • Airflow or Prefect β€” task scheduling, retry logic, observability
  • Kafka or Redis Streams β€” high-throughput real-time data queues
  • PostgreSQL β€” relational storage for historical archiving
  • Parquet β€” columnar storage for analytics
  • Docker β€” production-like isolated environment

πŸš€ Key Features

1. Real-Time Streaming from Market APIs

Built streaming connectors for:

  • Polygon.io β€” equities & options feed
  • Binance β€” crypto tick-level market data
  • Alpaca β€” equities quotes, trades, bars

Data is fetched continuously using async workers and pushed into a unified streaming layer (Kafka or Redis Streams).

2. Unified Normalization Layer

Every provider outputs data in a different shape.
I created a uniform schema inspired by my HFT engine’s protobuf structure:

symbol, timestamp, bid, ask, last_price, volume

Additional metadata (exchange, source, latency) is attached for audit and debugging.

Normalization ensures:

  • consistent timestamps
  • consistent key/value naming
  • uniform precision
  • uniform data types

3. ETL Workflows with Airflow / Prefect

ETL jobs run on a fully orchestrated scheduling system:

  • Real-time sync tasks (sub-second or minute intervals)
  • Backfill jobs for historical data
  • Retry logic with exponential backoff
  • SLA alerts and pipeline status monitoring
  • Automatic failure recovery

The DAGs mirror the operational workflow used in my HFT system.

4. High-Throughput Stream Processing

All normalized data is published to:

  • Kafka topics (market-data.raw, market-data.normalized)
    or
  • Redis Streams for lighter deployments

This enables:

  • consumer fan-out
  • auditing
  • backpressure handling
  • decoupling between ingestion and storage

This architecture mirrors the event sourcing model used in my HFT system (Kafka topics for orders, executions, market-data).

5. Storage Layer

Two storage engines are supported:

PostgreSQL

Ideal for relational queries, history lookups, and TimescaleDB integration.

Typical table schema:

nginx
ticks (id, symbol, ts, bid, ask, last, volume, provider)

Parquet Files

Optimized for:

  • analytics
  • backtesting
  • long-term storage
  • machine learning pipelines

Parquet files can be loaded directly by Python, Pandas, Spark, or DuckDB.

6. Monitoring and Observability

A lightweight monitoring dashboard was built to track:

  • pipeline uptime
  • ingestion rate (ticks/sec)
  • API latency
  • error counts
  • job status (Airflow/Prefect)
  • backlog depth in Kafka/Redis

This aligns with the monitoring stack in my HFT system (Grafana + Prometheus).

               Public Market APIs
       (Polygon, Binance, Alpaca, etc.)
                       β”‚
                       β–Ό
               Ingestion Workers (Python)
                       β”‚
                       β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€Streaming Layer─────────┐
         β”‚     Kafka / Redis Streams        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
           Normalization + ETL Processor
                       β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό                              β–Ό
   PostgreSQL (Timescale)         Parquet Data Lake
                       β”‚
                       β–Ό
            Monitoring + Dashboard

πŸ”— Connection to My HFT System

This Market Data Pipeline is designed to integrate seamlessly with my full HFT trading architecture:

  • Uses the same normalized schema structure as my Protobuf definitions (Order, ExecutionReport, MarketData)
  • Feeds directly into Kafka topics used by the trading engine
  • Provides historical data for the Python strategy layer and backtesting framework
  • Supports real-time updates to the Next.js frontend dashboard
  • Reuses infrastructure components (PostgreSQL, Redis, Docker, Grafana, Prometheus)

In short, this ETL pipeline is the data backbone for the HFT platform.

πŸ“ˆ Performance & Scalability

  • Handles 10k+ ticks per second on commodity hardware
  • Sub-10ms ingestion latency
  • Horizontal scaling via Kafka partitions or Redis consumer groups
  • Parquet storage enables multi-GB datasets to load in milliseconds with DuckDB

πŸ›  Example Components Built

βœ” Async data ingestion workers

βœ” Normalization engine

βœ” Kafka/Redis streaming producers

βœ” Airflow/Prefect DAGs

βœ” PostgreSQL/TimescaleDB storage model

βœ” Parquet export system

βœ” Monitoring dashboard

βœ” Dockerized deployment

πŸŽ“ What This Project Demonstrates

  • Real-world data engineering and pipeline design
  • Understanding of trading data sources
  • Ability to work with streaming systems
  • Experience building fault-tolerant, scalable ingestion services
  • Integration with full trading system architecture

Gallery