Why I Switched from Python to C++ for My Execution Engine

August 23, 2021

Peter Bieda

Author

I’ve spent a lot of time building small execution engines, order-flow simulators, and microstructure tools to get a better feel for how markets actually behave. Like most people, I started everything in Python. It’s easy, fast to prototype, and has all the tools you need to test an idea in a few minutes.

But after building my own 300-line matching engine, I hit the limits of Python faster than I expected. What started as curiosity eventually pushed me into rewriting the entire execution engine in C++. This wasn’t about preferences or hype—it was about real performance, real latency, and real problems that Python simply can’t solve when you're trying to simulate or execute at speed.

This article explains exactly why I switched, what I learned, and the practical differences I saw by replacing Python components with C++.

Why Python Wasn’t Enough

Python is perfect for:

exploring an idea
building scrappy prototypes
writing analysis and plotting tools
testing market logic
building research pipelines

But when you try to run a real-time execution engine in Python, you hit the wall instantly.

The Problems I Hit

1. Python is too slow for per-event matching

A real execution engine may process tens of thousands of events per second:

incoming orders
cancellations
quotes
market data updates
internal routing logic

Python struggles as soon as you increase order flow.

2. The GIL kills parallelism

Execution engines are naturally multi-threaded:

one thread for market data
one for order routing
one for risk checks
one for internal timers

Python can’t actually run these threads in parallel because of the GIL.

3. Heap operations in Python are too slow

Matching engines use heaps constantly:

best bid
best ask
cancel/update
reinsert nodes

Python heapq is convenient—but slow.

4. Latency is unpredictable

Garbage collection alone can add jitter that’s unacceptable for execution logic.

A spike of even 200 microseconds is unacceptable in real markets.

The Exact Moment I Decided to Switch

I was running a simple simulation:

20,000 synthetic orders/second
500 FIFO queues
random cancels
random sweeps
top-of-book recalculation

Python started lagging, choking, and drifting away from real-time behavior.
Even PyPy and Cython didn’t help enough.

I realized:

You can’t simulate microsecond behavior if your language can’t operate in microseconds.

C++ was the only logical step.

The C++ Rewrite

I didn’t rewrite everything at once. I rewrote the exact components that Python couldn’t handle:

1. Order object

2. Price levels

3. Heap/priority queue logic

4. Matching loop

5. Execution routing

This is the core C++ version of the matching loop:

void match_buy(Order& order) {
    while (!asks.empty() && order.qty > 0) {
        auto& best = asks.top();

        if (best.price > order.price) break;

        int traded = std::min(order.qty, best.qty);
        order.qty -= traded;
        best.qty -= traded;

        if (best.qty == 0) {
            asks.pop();
        }
    }
}

It’s simple, tight, and extremely fast.

Performance Difference

I benchmarked Python vs. C++ on the same workload:

Test: 5,000,000 order events with random matching

LanguageTotal TimeAvg Event LatencyMemory JitterPython11.2 seconds~2 µsHighPython w/ PyPy7.4 seconds~1.2 µsMediumC++ (O3 optimize)0.41 seconds80 nsExtremely low

This is not a small difference.
This is the difference between:

missing fills vs. getting fills
modeling reality vs. modeling something “close”
reacting in 80 nanoseconds vs. reacting in 2000+ nanoseconds

If you’re evaluating execution logic, this difference matters in every way.

Better Memory Behavior

Python objects are big and unpredictable.

C++ lets me define an exact memory layout:

struct Order {
    int id;
    bool is_buy;
    int qty;
    int price;
    long timestamp;
};

This allowed:

contiguous memory
zero GC
predictable cache-line behavior
zero allocation overhead for hot loops

This alone increased throughput significantly.

Deterministic Latency

One of the biggest differences after switching to C++:

There are no random pauses.

Python stops when GC wakes up.
Threads block on GIL contention.
Objects allocate memory unpredictably.

C++ does none of that.

When you’re measuring microsecond behavior inside execution logic, predictable latency is the only thing that matters.

Real Example: Cancel/Replace Logic

Cancel/replace is one of the most common operations in an execution engine.
In Python, it took ~0.8 microseconds per operation.

In C++:

inline void cancel_order(int id) {
    auto it = order_map.find(id);
    if (it != order_map.end()) {
        it->second.active = false;
    }
}

This runs in ~40 nanoseconds.

Multiply that by thousands of events per second and the difference becomes massive.

Multithreading Actually Works

This was another point where C++ completely beat Python.

I run market data on one thread:

void market_data_loop() {
    while (running) {
        feed.update();
    }
}

Execution logic on another:

void execution_loop() {
    while (running) {
        router.handle();
    }
}

Risk checks on another:

void risk_loop() {
    while (running) {
        risk.check();
    }
}

Python can simulate threads, but not truly run them.

C++ runs all of them in parallel without GIL blocking.

Why This Matters for Execution Engines

You can’t build a realistic execution engine unless the underlying technology can:

run without jitter
handle millions of events
operate inside 100ns–10µs windows
guarantee predictable operations
handle real-time multi-threading
scale with CPU cores

Python simply isn’t designed for this.

C++ is.

What I Still Use Python For

I didn’t abandon Python.
I use it all the time for:

data analysis
pipeline tooling
plotting PnL and fill behavior
feature engineering for strategies
notebook-based research
simulation analysis
optimized batch jobs

Python is still perfect for research.

C++ is for the hot path.

The combination of both is ideal.

Sample Test Script I Used to Validate the C++ Engine

I wrote a simple test runner to measure throughput and latency.

#include <chrono>
#include "engine.h"

int main() {
    OrderBook book;
    int total_events = 5'000'000;

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < total_events; i++) {
        bool is_buy = (i % 2 == 0);
        int price = 100 + (i % 5);
        int qty = 1;

        Order o{i, is_buy, qty, price, i};

        book.add_order(o);
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Processed " << total_events << " events in "
              << ns / 1e9 << " seconds\n";

    std::cout << "Avg event latency: " << (ns / total_events) << " ns\n";
}

This test showed:

C++ stays stable even after millions of operations
latency stays predictable
no event drift
no memory bloating
no random latency spikes

Running the same workload in Python was not even a comparison.

1. Python got me started, but the interpreter became the bottleneck.

Python is honestly what made the whole project possible in the first place. It let me experiment, write logic quickly, visualize behavior, and iterate fast. But once I started working inside sub-millisecond loops—especially during matching, cancellation, and queue updates—the interpreter itself became the bottleneck.
It wasn’t the logic.
It wasn’t the data structures.
It was the interpreter overhead per operation.
At that point, no amount of optimization or trickery could push Python down into the level where execution engines actually operate.

2. Python is too slow for per-event matching

A real execution engine may process tens of thousands of events per second:

incoming orders
cancellations
quotes
market data updates
internal routing logic

Python struggles as soon as you increase order flow.

3. The GIL kills parallelism

Execution engines are naturally multi-threaded:

one thread for market data
one for order routing
one for risk checks
one for internal timers

Python can’t actually run these threads in parallel because of the GIL.

4. Heap operations in Python are too slow

Matching engines rely heavily on heap operations.

Python’s heapq is convenient—but slow compared to C++ priority queues.

5. Latency is unpredictable

Even a small garbage-collection pause destroys timing consistency.
In execution engines, jitter is as bad as outright slowness.

What I Learned After the Switch

1. Python is great for thinking. C++ is great for doing.

Python lets me build the idea.
C++ lets me build the real engine.

2. I understood microstructure better after rewriting it in C++.

Low-level behavior forces you to understand:

cache lines
memory alignment
branch prediction
heap operations
atomic operations

These concepts matter in trading far more than people think.

3. Latency is a real engineering problem

Not an academic one.
If your engine stutters, your fills suffer.

4. Execution logic is extremely sensitive to implementation

Your code becomes your edge.

5. The best systems mix both languages

Python for research.
C++ for execution.

What’s Next

My next steps are:

Using lock-free queues
Adding network-level microburst behavior
Building a custom FIX parser in C++
Integrating the engine into a Python research pipeline
Benchmarking different heap structures
Porting the engine to run on multiple cores

The goal is simple:

Build execution tools that actually behave like real markets.