Market Data Pipeline (ETL System)

A production-grade real-time market data ingestion and processing pipeline, designed to demonstrate strong capabilities in data engineering, streaming systems, and automated ETL workflows used in modern trading platforms.

This project complements my full HFT trading engine by providing the underlying data flow, storage, and monitoring infrastructure that a real trading system depends on.

🎯 Goal

Build a robust, scalable data infrastructure that can continuously collect, normalize, store, and monitor financial market data in real time — following the same architecture used by professional trading firms.

🧰 Tech Stack

Python — ETL logic, data normalization, ingestion workers
Airflow or Prefect — task scheduling, retry logic, observability
Kafka or Redis Streams — high-throughput real-time data queues
PostgreSQL — relational storage for historical archiving
Parquet — columnar storage for analytics
Docker — production-like isolated environment

🚀 Key Features

1. Real-Time Streaming from Market APIs

Built streaming connectors for:

Polygon.io — equities & options feed
Binance — crypto tick-level market data
Alpaca — equities quotes, trades, bars

Data is fetched continuously using async workers and pushed into a unified streaming layer (Kafka or Redis Streams).

2. Unified Normalization Layer

Every provider outputs data in a different shape.
I created a uniform schema inspired by my HFT engine’s protobuf structure:

symbol, timestamp, bid, ask, last_price, volume

Additional metadata (exchange, source, latency) is attached for audit and debugging.

Normalization ensures:

consistent timestamps
consistent key/value naming
uniform precision
uniform data types

3. ETL Workflows with Airflow / Prefect

ETL jobs run on a fully orchestrated scheduling system:

Real-time sync tasks (sub-second or minute intervals)
Backfill jobs for historical data
Retry logic with exponential backoff
SLA alerts and pipeline status monitoring
Automatic failure recovery

The DAGs mirror the operational workflow used in my HFT system.

4. High-Throughput Stream Processing

All normalized data is published to:

Kafka topics (market-data.raw, market-data.normalized)
or
Redis Streams for lighter deployments

This enables:

consumer fan-out
auditing
backpressure handling
decoupling between ingestion and storage

This architecture mirrors the event sourcing model used in my HFT system (Kafka topics for orders, executions, market-data).

5. Storage Layer

Two storage engines are supported:

PostgreSQL

Ideal for relational queries, history lookups, and TimescaleDB integration.

Typical table schema:

nginx

ticks (id, symbol, ts, bid, ask, last, volume, provider)

Parquet Files

Optimized for:

analytics
backtesting
long-term storage
machine learning pipelines

Parquet files can be loaded directly by Python, Pandas, Spark, or DuckDB.

6. Monitoring and Observability

A lightweight monitoring dashboard was built to track:

pipeline uptime
ingestion rate (ticks/sec)
API latency
error counts
job status (Airflow/Prefect)
backlog depth in Kafka/Redis

This aligns with the monitoring stack in my HFT system (Grafana + Prometheus).

               Public Market APIs
       (Polygon, Binance, Alpaca, etc.)
                       │
                       ▼
               Ingestion Workers (Python)
                       │
                       ▼
         ┌──────────Streaming Layer─────────┐
         │     Kafka / Redis Streams        │
         └──────────────────────────────────┘
                       │
                       ▼
           Normalization + ETL Processor
                       │
         ┌─────────────┴────────────────┐
         ▼                              ▼
   PostgreSQL (Timescale)         Parquet Data Lake
                       │
                       ▼
            Monitoring + Dashboard

🔗 Connection to My HFT System

This Market Data Pipeline is designed to integrate seamlessly with my full HFT trading architecture:

Uses the same normalized schema structure as my Protobuf definitions (Order, ExecutionReport, MarketData)
Feeds directly into Kafka topics used by the trading engine
Provides historical data for the Python strategy layer and backtesting framework
Supports real-time updates to the Next.js frontend dashboard
Reuses infrastructure components (PostgreSQL, Redis, Docker, Grafana, Prometheus)

In short, this ETL pipeline is the data backbone for the HFT platform.