← Back to Blog

I Used My Gaming GPU to Revolutionize Trading Strategy

Peter Bieda

Author

The Problem Every Algorithmic Trader Eventually Hits

Parameter optimization in algorithmic trading is brutal.

Buy thresholds, sell thresholds, compounding modes, position sizing—every strategy quickly turns into a multi-dimensional search problem. The standard approach is sequential backtesting combined with Bayesian optimization tools like Optuna.

It works.
But it’s painfully slow.

On my setup:

  • ~200 parameter combinations for a single stock took 40+ seconds
  • A portfolio of 50 stocks meant 30+ minutes per optimization run
  • Real-time iteration during market hours was completely unrealistic

Meanwhile, my workstation had an NVIDIA RTX 3090 sitting idle.

10,496 CUDA cores.
24 GB of VRAM.
~2% utilization.

My CPU was screaming at 100%, and the GPU was doing nothing.

That’s when the obvious question hit me:

Why am I running backtests sequentially on 8 CPU cores when I have a small supercomputer sitting next to me?

The Key Insight: Backtesting Is Embarrassingly Parallel

Trading strategy backtests are a textbook example of an embarrassingly parallel problem:

  • Each parameter combination is completely independent
  • All strategies use the same price data
  • The workload is pure numerical computation

This is exactly what GPUs are designed for.

So I moved my backtester to the GPU.

And… it wasn’t faster.

Sometimes it was actually slower than the CPU.

The Real Bottleneck: Data Transfer, Not Compute

Most GPU tutorials obsess over raw compute power:

“The RTX 3090 has 35.58 TFLOPS!”

That number is meaningless if you move data incorrectly.

The biggest performance killer in GPU workloads is data transfer, not math.

Every CPU → GPU copy costs:

  • PCIe latency
  • Memory allocation overhead
  • Synchronization barriers

My first attempt looked like this:

# ❌ The wrong way: transferring data every iteration
for params in all_parameter_combinations:
    gpu_data = cp.array(price_data)
    result = backtest_gpu(gpu_data, params)
    results.append(result)

Even though each backtest was fast, I paid the transfer cost thousands of times. The overhead completely erased the GPU advantage.

The Fix: Three Rules That Changed Everything

1. Load Data to the GPU Once — and Never Again

Instead of transferring price data per strategy, I load the entire dataset once during initialization and keep it in GPU memory:

class GPUBatchBacktester:
    def __init__(self, bars, capital):
        self.open_prices = cp.array([b["o"] for b in bars], dtype=cp.float32)
        self.high_prices = cp.array([b["h"] for b in bars], dtype=cp.float32)
        self.close_prices = cp.array([b["c"] for b in bars], dtype=cp.float32)

        self.prev_close = cp.zeros_like(self.close_prices)
        self.prev_close[1:] = self.close_prices[:-1]

A full year of daily bars is only a few kilobytes—trivial for a 24 GB GPU.
But removing repeated transfers was huge.

2. Use Pinned (Page-Locked) Memory for Faster Transfers

Even the initial load can be optimized.

CuPy supports CUDA pinned memory, which enables direct DMA transfers:

# ✅ Faster CPU → GPU transfer using pinned memory
pinned_mem = cp.cuda.alloc_pinned_memory(len(data) * 4)
gpu_array = cp.array(data, dtype=cp.float32)

Pinned memory bypasses CPU caching and allows the GPU to pull data at full PCIe bandwidth.

This alone gave me a 2–3× speedup during initialization.

3. Vectorize Everything (This Is the Breakthrough)

The real leap came when I stopped thinking in terms of “one strategy at a time.”

Instead of running 10,000 backtests sequentially, I run 10,000 strategies simultaneously.

Conceptually:

# ❌ Sequential
for i in range(10000):
    results[i] = backtest(params[i])

# ✅ Vectorized
results = backtest_vectorized(all_params)

Instead of tracking one position, I track thousands of positions as vectors:

def run_batch(self, buy_triggers, sell_triggers):
    n = len(buy_triggers)

    positions = xp.zeros(n, dtype=xp.float32)
    cash = xp.full(n, self.capital, dtype=xp.float32)
    entry_prices = xp.zeros(n, dtype=xp.float32)

    for bar_idx in range(self.n_bars):
        pct_change = (
            high[bar_idx] - prev_close[bar_idx]
        ) / prev_close[bar_idx]

        should_buy = (positions == 0) & (pct_change >= buy_triggers)

        positions = xp.where(should_buy, cash / entry_price, positions)
        cash = xp.where(should_buy, 0, cash)

No Python loops over strategies.
No thread management.
No task scheduling.

Just massive SIMD execution and NumPy-style broadcasting.

The Results: 695× Throughput Increase

Here’s what the RTX 3090 delivered after applying all three optimizations:

MetricCPU (Optuna)GPU (Grid Search)GainParameter combinations20013,77569×Time per stock~40 s~4 s10×Strategies per second~53,475695×Search methodBayesianExhaustiveExact

The most important improvement wasn’t just speed.

It was certainty.

Bayesian optimization is heuristic—it can miss the global optimum.
GPU-powered grid search tests every single combination.

No guesses.
No blind spots.
No “maybe there’s a better region.”

What This Changed in Practice

Before:

  • 50 stocks × 40 seconds ≈ 33 minutes
  • 200 parameter combinations
  • Approximate solutions

After:

  • 50 stocks × 4 seconds ≈ 3.3 minutes
  • 13,775 combinations per stock
  • Guaranteed optimal parameters

That’s:

  • 10× faster wall-clock time
  • 69× deeper parameter coverage

The Code & Setup

Project structure:

optimizer/
├── gpu_backtest.py    # GPU batch backtester
├── Dockerfile.gpu     # CUDA 12.2 + CuPy
├── server.py          # FastAPI (CPU/GPU toggle)
└── static/            # Web UI with live progress

Tech stack:

  • CuPy (cupy-cuda12x)
  • NVIDIA CUDA 12.2
  • Docker + NVIDIA Container Toolkit

Quick Start

docker build -f Dockerfile.gpu -t optimizer-gpu .
docker run -d --gpus all -p 8082:8000 optimizer-gpu

Key Takeaways

  1. GPU optimization is about data movement first, compute second
  2. Vectorization beats parallel loops every time
  3. A gaming GPU is a serious quantitative research tool
  4. Exhaustive search becomes practical with GPU acceleration
  5. Your first GPU version will be slow—profiling is mandatory

What’s Next

  • Multi-GPU scaling for massive parameter spaces
  • Custom CUDA kernels for complex strategy logic
  • Real-time optimization during market hours
  • Portfolio-level, multi-asset optimization

If this was useful, the full source code is available in the PPOAlgo repository on GitHub. Contributions and PRs are welcome.