Why I Switched from Python to C++ for My Execution Engine
Peter Bieda
Author
I’ve spent a lot of time building small execution engines, order-flow simulators, and microstructure tools to get a better feel for how markets actually behave. Like most people, I started everything in Python. It’s easy, fast to prototype, and has all the tools you need to test an idea in a few minutes.
But after building my own 300-line matching engine, I hit the limits of Python faster than I expected. What started as curiosity eventually pushed me into rewriting the entire execution engine in C++. This wasn’t about preferences or hype—it was about real performance, real latency, and real problems that Python simply can’t solve when you're trying to simulate or execute at speed.
This article explains exactly why I switched, what I learned, and the practical differences I saw by replacing Python components with C++.
Why Python Wasn’t Enough
Python is perfect for:
- exploring an idea
- building scrappy prototypes
- writing analysis and plotting tools
- testing market logic
- building research pipelines
But when you try to run a real-time execution engine in Python, you hit the wall instantly.
The Problems I Hit
1. Python is too slow for per-event matching
A real execution engine may process tens of thousands of events per second:
- incoming orders
- cancellations
- quotes
- market data updates
- internal routing logic
Python struggles as soon as you increase order flow.
2. The GIL kills parallelism
Execution engines are naturally multi-threaded:
- one thread for market data
- one for order routing
- one for risk checks
- one for internal timers
Python can’t actually run these threads in parallel because of the GIL.
3. Heap operations in Python are too slow
Matching engines use heaps constantly:
- best bid
- best ask
- cancel/update
- reinsert nodes
Python heapq is convenient—but slow.
4. Latency is unpredictable
Garbage collection alone can add jitter that’s unacceptable for execution logic.
A spike of even 200 microseconds is unacceptable in real markets.
The Exact Moment I Decided to Switch
I was running a simple simulation:
- 20,000 synthetic orders/second
- 500 FIFO queues
- random cancels
- random sweeps
- top-of-book recalculation
Python started lagging, choking, and drifting away from real-time behavior.
Even PyPy and Cython didn’t help enough.
I realized:
You can’t simulate microsecond behavior if your language can’t operate in microseconds.
C++ was the only logical step.
The C++ Rewrite
I didn’t rewrite everything at once. I rewrote the exact components that Python couldn’t handle:
1. Order object
2. Price levels
3. Heap/priority queue logic
4. Matching loop
5. Execution routing
This is the core C++ version of the matching loop:
void match_buy(Order& order) {
while (!asks.empty() && order.qty > 0) {
auto& best = asks.top();
if (best.price > order.price) break;
int traded = std::min(order.qty, best.qty);
order.qty -= traded;
best.qty -= traded;
if (best.qty == 0) {
asks.pop();
}
}
}
It’s simple, tight, and extremely fast.
Performance Difference
I benchmarked Python vs. C++ on the same workload:
Test: 5,000,000 order events with random matching
LanguageTotal TimeAvg Event LatencyMemory JitterPython11.2 seconds~2 µsHighPython w/ PyPy7.4 seconds~1.2 µsMediumC++ (O3 optimize)0.41 seconds80 nsExtremely low
This is not a small difference.
This is the difference between:
- missing fills vs. getting fills
- modeling reality vs. modeling something “close”
- reacting in 80 nanoseconds vs. reacting in 2000+ nanoseconds
If you’re evaluating execution logic, this difference matters in every way.
Better Memory Behavior
Python objects are big and unpredictable.
C++ lets me define an exact memory layout:
struct Order {
int id;
bool is_buy;
int qty;
int price;
long timestamp;
};
This allowed:
- contiguous memory
- zero GC
- predictable cache-line behavior
- zero allocation overhead for hot loops
This alone increased throughput significantly.
Deterministic Latency
One of the biggest differences after switching to C++:
There are no random pauses.
Python stops when GC wakes up.
Threads block on GIL contention.
Objects allocate memory unpredictably.
C++ does none of that.
When you’re measuring microsecond behavior inside execution logic, predictable latency is the only thing that matters.
Real Example: Cancel/Replace Logic
Cancel/replace is one of the most common operations in an execution engine.
In Python, it took ~0.8 microseconds per operation.
In C++:
inline void cancel_order(int id) {
auto it = order_map.find(id);
if (it != order_map.end()) {
it->second.active = false;
}
}
This runs in ~40 nanoseconds.
Multiply that by thousands of events per second and the difference becomes massive.
Multithreading Actually Works
This was another point where C++ completely beat Python.
I run market data on one thread:
void market_data_loop() {
while (running) {
feed.update();
}
}
Execution logic on another:
void execution_loop() {
while (running) {
router.handle();
}
}
Risk checks on another:
void risk_loop() {
while (running) {
risk.check();
}
}
Python can simulate threads, but not truly run them.
C++ runs all of them in parallel without GIL blocking.
Why This Matters for Execution Engines
You can’t build a realistic execution engine unless the underlying technology can:
- run without jitter
- handle millions of events
- operate inside 100ns–10µs windows
- guarantee predictable operations
- handle real-time multi-threading
- scale with CPU cores
Python simply isn’t designed for this.
C++ is.
What I Still Use Python For
I didn’t abandon Python.
I use it all the time for:
- data analysis
- pipeline tooling
- plotting PnL and fill behavior
- feature engineering for strategies
- notebook-based research
- simulation analysis
- optimized batch jobs
Python is still perfect for research.
C++ is for the hot path.
The combination of both is ideal.
Sample Test Script I Used to Validate the C++ Engine
I wrote a simple test runner to measure throughput and latency.
#include <chrono>
#include "engine.h"
int main() {
OrderBook book;
int total_events = 5'000'000;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < total_events; i++) {
bool is_buy = (i % 2 == 0);
int price = 100 + (i % 5);
int qty = 1;
Order o{i, is_buy, qty, price, i};
book.add_order(o);
}
auto end = std::chrono::high_resolution_clock::now();
auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Processed " << total_events << " events in "
<< ns / 1e9 << " seconds\n";
std::cout << "Avg event latency: " << (ns / total_events) << " ns\n";
}
This test showed:
- C++ stays stable even after millions of operations
- latency stays predictable
- no event drift
- no memory bloating
- no random latency spikes
Running the same workload in Python was not even a comparison.
1. Python got me started, but the interpreter became the bottleneck.
Python is honestly what made the whole project possible in the first place. It let me experiment, write logic quickly, visualize behavior, and iterate fast. But once I started working inside sub-millisecond loops—especially during matching, cancellation, and queue updates—the interpreter itself became the bottleneck.
It wasn’t the logic.
It wasn’t the data structures.
It was the interpreter overhead per operation.
At that point, no amount of optimization or trickery could push Python down into the level where execution engines actually operate.
2. Python is too slow for per-event matching
A real execution engine may process tens of thousands of events per second:
- incoming orders
- cancellations
- quotes
- market data updates
- internal routing logic
Python struggles as soon as you increase order flow.
3. The GIL kills parallelism
Execution engines are naturally multi-threaded:
- one thread for market data
- one for order routing
- one for risk checks
- one for internal timers
Python can’t actually run these threads in parallel because of the GIL.
4. Heap operations in Python are too slow
Matching engines rely heavily on heap operations.
Python’s heapq is convenient—but slow compared to C++ priority queues.
5. Latency is unpredictable
Even a small garbage-collection pause destroys timing consistency.
In execution engines, jitter is as bad as outright slowness.
What I Learned After the Switch
1. Python is great for thinking. C++ is great for doing.
Python lets me build the idea.
C++ lets me build the real engine.
2. I understood microstructure better after rewriting it in C++.
Low-level behavior forces you to understand:
- cache lines
- memory alignment
- branch prediction
- heap operations
- atomic operations
These concepts matter in trading far more than people think.
3. Latency is a real engineering problem
Not an academic one.
If your engine stutters, your fills suffer.
4. Execution logic is extremely sensitive to implementation
Your code becomes your edge.
5. The best systems mix both languages
Python for research.
C++ for execution.
What’s Next
My next steps are:
- Using lock-free queues
- Adding network-level microburst behavior
- Building a custom FIX parser in C++
- Integrating the engine into a Python research pipeline
- Benchmarking different heap structures
- Porting the engine to run on multiple cores
The goal is simple:
Build execution tools that actually behave like real markets.