Optimizing Tick Data Storage: From SQLite to Parquet

November 28, 2019

Peter Bieda

Author

Every engineer who works with market data eventually hits a wall. For me, it happened at around 50 GB — the moment my laptop started sounding like a jet engine and my disk I/O turned into a slow-motion torture test.

I had built a simple tick data collector using SQLite. It worked fine at first: the schema was clean, inserts were fast enough, and querying the database for backtests was convenient.

Then I let it run for three weeks.

Suddenly:

Write speeds slowed to a crawl
Backtests were taking minutes instead of seconds
The database file ballooned to ~50 GB
Any VACUUM or index rebuild locked the entire database

It wasn’t sustainable. I needed something more efficient — not only in storage, but in read speed, compression, and scalability.

That’s when I switched to columnar data storage, specifically Apache Parquet. That single decision turned 50 GB of raw tick data into roughly 3 GB, with drastically faster read times. And more importantly, it changed how I think about data engineering for trading.

Here’s the breakdown of that journey: what went wrong with SQLite, why Parquet works so much better for market data, and how I redesigned the entire pipeline around it.

Why SQLite Was My First Choice

SQLite gets used in more trading projects than people admit.

It’s attractive because:

It requires no server
It supports fast ACID-compliant writes
It stores everything in one file
It integrates nicely with Python
It’s perfect for prototyping or local analysis

My schema looked roughly like this:

CREATE TABLE ticks (
    ts REAL,
    symbol TEXT,
    bid REAL,
    ask REAL,
    bid_size REAL,
    ask_size REAL
);

And writing rows from Python was straightforward:

cursor.execute("INSERT INTO ticks VALUES (?, ?, ?, ?, ?, ?)", row)

For the first few million rows, it was gloriously simple.

Then the problems started.

Problem #1: SQLite Is Not Designed for High-Throughput Tick Logs

Financial tick data grows absurdly fast:

10,000 ticks/second
1 trading day → millions of rows
Multiple instruments → exponential growth

SQLite is optimized for:

Moderate read/write loads
Small-to-medium datasets
Simple transaction patterns

But high-frequency tick-level inserts exposed its bottlenecks:

Row-by-row inserts hammered I/O

Each insert forces SQLite to update internal structures. Even with optimizations like batched transactions, the disk overhead became significant.

Indexes made inserts slower

To query efficiently, I needed indexes on ts and symbol.
But indexing millions of rows made inserts painfully slow.

File size grew uncontrollably

Even basic compression-free storage for floats and timestamps bloated quickly.

VACUUM locked the database

Reclaiming empty pages required exclusive locks. On a 50 GB file, this took hours.

SQLite wasn’t broken — I was simply using it for a workload it wasn’t meant to handle.

Problem #2: Backtesting Performance Was Unacceptable

When I tried to load several months of tick data into Pandas for backtesting, the performance was terrible.

A simple query like:

SELECT * FROM ticks WHERE symbol = 'ES' ORDER BY ts;

became slow because:

SQLite had to read entire rows (row-major layout)
Text-based timestamp storage added overhead
IO latency increased as the file grew
Python’s DB-API added serialization/deserialization cost

Loading a month of data into a DataFrame took 20–25 seconds.

That might be fine for occasional analysis, but not for iterative research where I may run dozens of backtests per hour.

Discovering Parquet: The Perfect Match for Tick Data

At some point, every quant engineer reaches the same conclusion:

Tick data isn’t row-oriented — it’s column-oriented.
We query by:

time filtering
symbol filtering
specific fields (bid, ask, midprice, volume)
time windows

This is exactly what columnar formats like Parquet were designed for.

Why Parquet Rocks for Market Data

✔ Highly compressed
Compression ratios of 10x–20x are normal due to repeated values and columnar layout.

✔ Columnar reads
If you only need bid/ask, Parquet reads only those columns — SQLite reads entire rows.

✔ Vectorized I/O
Parquet is optimized for high-throughput scanning, perfect for backtesting.

✔ Metadata-rich
Indexing, schema evolution, and file partitioning are all first-class features.

✔ Works seamlessly with Pandas, Polars, PyArrow, DuckDB
Load whole datasets in seconds, not minutes.

✔ Immutable & append-friendly
Great for incremental tick ingestion.

The Migration: From SQLite Rows to Parquet Columns

I exported my SQLite database symbol by symbol, like this:

import sqlite3
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

conn = sqlite3.connect("ticks.db")

symbols = ["ES", "NQ", "CL", "GC"]

for symbol in symbols:
    df = pd.read_sql_query(
        f"SELECT * FROM ticks WHERE symbol = '{symbol}' ORDER BY ts",
        conn
    )

    table = pa.Table.from_pandas(df)
    pq.write_table(table, f"data/{symbol}.parquet", compression="snappy")

This simple script gave me:

much smaller files
faster reads
per-symbol segmentation
more efficient backtesting

The result blew my mind:

Compression stats

FormatSizeSQLite (uncompressed)~50 GBParquet (Snappy)~3.1 GBParquet (ZSTD)~2.6 GB

Going from 50 GB → 3 GB was not just a storage win. It fundamentally transformed the workflow.

Building a Better Pipeline: Partitioning + Appending

Once I switched to Parquet, I redesigned the entire ingestion pipeline with the following structure:

/ticks/
    /ES/
        2024-01-01.parquet
        2024-01-02.parquet
        2024-01-03.parquet
    /NQ/
        2024-01-01.parquet

Each file contained:

One symbol
One trading day
Columns:
- timestamp (int64 ns or float64)
- bid
- ask
- bid_size
- ask_size

Why this structure matters

✔ Efficient time filtering
✔ Easy parallel backtesting
✔ No single monolithic file
✔ Faster reads
✔ Append-only model
✔ Compatible with DuckDB and Polars query engines

It turned the pipeline into something far closer to real institutional tick data stores like:

KDB+
QuestDB
ClickHouse
Elastic-backed time-series systems

Obviously Parquet isn’t as fast as KDB+, but for Python-based research, it’s perfect.

Benchmarking: SQLite vs Parquet vs DuckDB on Parquet

I ran performance tests to compare the formats for a common operation:

Load all ticks for ES on a given day and compute the midprice.

SQLite Query → Pandas

Avg: 21.4 seconds

Parquet → Pandas

Avg: 2.8 seconds

DuckDB Query Directly on Parquet

Avg: 0.42 seconds

Polars Scan

Avg: 0.19 seconds

Polars and DuckDB were clear winners due to their vectorized engines and ability to query Parquet without loading everything into memory.

Engineering Lessons Learned (the hard way)

This project taught me several lessons that now shape how I design data pipelines for trading.

1. Row stores break under tick-level data

SQLite and MySQL are perfect for many workloads — but not millions of rapidly appended tiny rows. Tick data is inherently columnar.

2. Compression is not optional

Snappy, ZSTD, and dictionary encoding work wonders on repetitive financial data.

3. Schema design matters

Storing timestamps as ISO strings wastes gigabytes.
Storing sizes as floating-point values wastes CPU.
Vectorized formats enforce cleaner schemas.

4. Backtesting performance is everything

If loading data takes too long, iteration speed dies — and so does research velocity.

With Parquet + DuckDB, I can now run 100+ backtests per hour without bottlenecks.

5. Small inefficiencies compound at scale

Originally, I ignored:

timestamp precision
unnecessary string fields
row alignment
redundant columns

At 10,000 rows, they don’t matter.
At 500 million rows, they explode.

6. Columnar formats are the standard for market data for a reason

Almost every modern quant stack uses:

Parquet
Arrow
KDB+
ClickHouse
Custom C++ column stores

The reason is simple: columns match how we actually use the data.

Future Improvements: Where This System Goes Next

Now that I’ve built a stable Parquet-based pipeline, the next steps are clear:

✔ Real-time compression in ingestion process

Instead of writing raw ticks then compressing later.

✔ Using Arrow datasets for massive scalability

Arrow supports partition discovery, schema merging, and fast metadata operations.

✔ Integrating DuckDB for on-disk SQL

DuckDB enables queries like:

SELECT *
FROM 'ticks/ES/*.parquet'
WHERE ts BETWEEN ...

without loading anything into memory.

✔ Using Polars for ultra-fast scanning

Polars scans are 10–30x faster than Pandas for many workloads.

✔ Testing ZSTD(9) for even better compression

This could bring 3 GB down toward ~2 GB depending on entropy.

Conclusion: SQLite Was a Great Start — But Parquet Made It a Real System

My first 50 GB of tick data nearly killed my laptop.
Parquet saved it.

But beyond the compression win, switching formats changed how I think about building trading data systems:

Data layout affects research speed
Storage decisions affect strategy performance
Compression affects cost and scalability
The right tools eliminate entire classes of bottlenecks

Today, my tick data pipeline is faster, smaller, and far easier to work with. It feels like the difference between using a USB drive and a real database.

If you’re collecting market data or running backtests, do yourself a favor:

Start with Parquet.
Start with Arrow.
Start with columnar.

Your disk — and your sanity — will thank you.