← Back to Blog

Integrating Reinforcement Learning for Adaptive Market-Making

Peter Bieda

Author

The first time I tried using reinforcement learning (RL) for market-making, I expected fireworks—some kind of emergent intelligence capable of exploiting tiny microstructure inefficiencies. What I actually got was an agent that spammed quotes into toxic flow, chased every price move, and died almost instantly during volatility bursts.

It taught me a simple lesson:

RL doesn’t automatically create intelligence.
It amplifies whatever structure you design into the environment.

At first, I treated it like a prediction tool. The more I experimented, the more I realized it’s fundamentally a behavior-shaping framework. And the most important behaviors in market-making are defensive:

  • knowing when spreads are fake,
  • when order flow is toxic,
  • when latency is drifting,
  • when imbalance is predictive,
  • when volatility will make any quote dangerous.

Humans are very good at not trading in the wrong places. RL can learn that too—if the environment forces it to.

Over time, integrating RL into my market-making research stack became less about “learning optimal quoting” and more about “learning the negative space of the market.” That reframing made the entire effort viable.

This article goes deep into that process: the environment, the reward design, the simulation framework, the experiments, the failures, and ultimately where RL actually helps.

1. The Market-Making Problem Through the RL Lens

Market-making is a constant negotiation between:

  • providing liquidity (earn the spread)
  • avoiding toxic flow (protect inventory)

Most developers try to hand-engineer rules:

  • “If volatility > threshold: widen.”
  • “If imbalance rises: reduce size.”
  • “If mid-price momentum increases: cancel.”

These rules work—until they don’t.
Markets shift regimes constantly, and threshold-based systems fall apart.

RL reframes market-making as a transition problem:

State → Action → Market Reaction → Updated State

Where the agent continuously adjusts behavior based on observed conditions.

This doesn’t mean the agent becomes a genius.
Instead:

It becomes very good at reacting to microstructure patterns you define—especially patterns associated with danger.

That was the mental shift that allowed me to apply RL productively.

2. Building the RL Environment From Scratch

I didn’t want a black-box environment.
I wanted full control over:

  • price dynamics,
  • order arrival models,
  • fill mechanics,
  • latency,
  • imbalance generation,
  • inventory risk.

So I built a full proprietary simulation environment with the following modules:

2.1 Price Process

I started with a discrete mid-price model:

mid[t+1] = mid[t] + drift + volatility * shock

But I added microstructure patterns:

  • volatility clustering,
  • jump processes during imbalance sweeps,
  • spread compression during high-liquidity periods,
  • stochastic latency regimes.

These patterns matter greatly because RL trains on distributions—not on one clean scenario.

2.2 Order Flow Model

I added three types of incoming flow:

  1. Passive noise flow – benign liquidity takers
  2. Toxic momentum flow – informed orders during jumps
  3. Spoof-like flow – ephemeral depth that disappears quickly

This mixture is essential.
If your environment only has noise, your RL agent becomes way too aggressive.

2.3 Fill Mechanics

Quotes have probabilities of:

  • being hit
  • being partially filled
  • being swept

Depending on:

  • distance from mid,
  • regime,
  • imbalance,
  • the “toxicity factor.”

This is where the agent learns fear.

2.4 Latency and Queue Position

Real market-making is not instantaneous.
So I added:

  • decision latency,
  • cancellation latency,
  • queue position decay.

If the agent quotes too aggressively during a regime shift, it suffers.

3. The State Representation That Actually Worked

Early on, my state vector was too large and noisy (20+ features).
The agent behaved randomly.

The best-performing representation ended up being just:

[spread,
 volatility,
 short_term_trend,
 inventory,
 order_book_imbalance,
 latency_regime]

Why this works:

  • spread affects how attractive quoting is
  • volatility affects adverse selection
  • trend predicts momentum bursts
  • inventory is critical for risk
  • imbalance warns of directional sweeps
  • latency regime informs tail-risk exposure

This state space hit the sweet spot: informative but stable.

4. Designing a Reward Function That Doesn’t Explode

Reward is the heart of RL.
The wrong reward creates pathological behavior.

My final reward formula looked like:

reward = realized_pnl \
         - 0.15 * abs(inventory) \
         - 0.5 * adverse_selection_penalty \
         - 0.1 * toxic_fill_penalty \
         - quote_costs \
         + liquidity_rebate

Let me break this down:

4.1 realized_pnl

Standard. Encourages profitable quoting.

4.2 inventory penalty

Essential. Prevents runaway exposure.

4.3 adverse_selection_penalty

This is the magic.
It triggers when:

  • mid-price jumps right after the agent trades.

This is how RL learns “don’t quote into momentum.”

4.4 toxic_fill_penalty

A separate penalty for fills that coincide with abnormal imbalance.

4.5 quote_costs

Discourages over-quoting.

4.6 liquidity rebates

Optional positive reward to promote providing stable liquidity.

This mixture allowed the agent to learn caution—without disabling profitable behavior.

5. The First Working Prototype (Simplified Python Example)

Here’s a simplified version of the training loop:

env = MarketMakingEnv()
agent = PPO(
    state_dim=env.state_size,
    action_dim=env.action_size,
    hidden_sizes=[128, 128],
)

for epoch in range(500):
    states, actions, rewards, dones = [], [], [], []

    state = env.reset()
    for t in range(env.max_steps):
        action = agent.act(state)
        next_state, reward, done = env.step(action)

        agent.remember(state, action, reward, next_state, done)

        state = next_state
        if done:
            break

    agent.train()

Not production code, but it captures the flow.

6. What the RL Agent Actually Learned

After several hundred episodes, I observed recurring behaviors.

6.1 It learned to cancel before volatility spikes.

When volatility started clustering, the agent often:

  • widened its quotes,
  • then cancelled both sides,
  • then waited for the burst to end.

This is exactly what a human trader would do.

6.2 It leaned quotes to unwind inventory.

If inventory drifted:

  • long → agent quoted more aggressively on the ask
  • short → more aggressively on the bid

It wasn’t perfect, but the intention was obvious.

6.3 It developed “fear” of imbalance surges.

When imbalance suddenly spiked to one side (predicting sweeps):

  • it avoided quoting inside the spread
  • it reduced size
  • sometimes it exited completely

This is the single most important market-making behavior.

6.4 It didn’t chase micro-momentum.

One of the biggest beginner mistakes is:

“Momentum up? Quote up!”

That’s how you get run over.

The RL agent stopped doing that after proper reward shaping.

6.5 It developed patience.

Instead of quoting constantly, it learned when silence was strategy.

7. Extending the Framework: Multi-Agent and Adversarial Conditions

Once the single-agent RL environment stabilized, I expanded further.

7.1 Multi-Agent Competition

I simulated rival market-makers with different behaviors:

  • fast but dumb
  • slow but stable
  • latency-spiky
  • aggressive leaners
  • fading liquidity providers

The RL agent learned how to behave under competition pressure.

7.2 Adversarial Order Flow

I generated synthetic adversarial patterns:

  • fake liquidity bursts
  • spoof-like depth
  • nanosecond-level quote flickering
  • momentum ignition

Then I retrained the agent.

It learned to stay out of these traps.

7.3 Regime Shifts

I added transitions between:

  • calm periods,
  • news-like spikes,
  • low-latency congestion,
  • asymmetric spread compression,
  • slow-trending regimes.

Over time, the RL agent tagged each regime with its own behavioral style.

This is where RL shines:
adaptive behavior under non-stationary conditions.

8. Integrating RL Into a Real Trading System Architecture

I didn’t use RL as a live decision-maker.
Instead, the architecture looked like this:

                       Training Sandbox
                RL Agent Learns Behaviors
            Behavioral Snapshots / Policy Vectors
   Feature Extraction → Risk Model → Execution Engine

The RL agent produces:

  • learned “avoidance maps,”
  • risk heuristics,
  • regime-detection patterns,
  • quoting sensitivities.

These get distilled into:

  • offline risk tests,
  • behavior heuristics embedded in C++ engines,
  • early warning signals,
  • quoting constraints.

RL informs the strategy. It doesn’t run it.

This hybrid approach is what most real HFT shops prefer.

9. The Key Insight: RL Learns Absences

The biggest misconception:

“RL will learn optimal trading.”

False.
Markets are too noisy and too adversarial.

Instead:

RL learns where not to engage.

It identifies zones of:

  • hidden toxicity,
  • order-flow imbalance,
  • latency divergence,
  • regime transitions,
  • structural microstructure fragility.

These “no-trade regions” are the real edge.

Because:

  • Humans notice them inconsistently.
  • Hard-coded rules capture only a slice.
  • ML classifiers detect them with delay.
  • Industrial RL learns them naturally.

This is the entire value of RL for market-making.

10. Example Diagram for Portfolio Presentation

You can paste this directly into any Markdown editor.

                  ┌─────────────────────────────┐
                  │  Synthetic Market Simulator │
(microstructure + latency)                  └──────────────┬──────────────┘
                   ┌────────────────────────┐
                   │   RL Training Engine- PPO / A2C / Q-LR    │
                   └──────────────┬─────────┘
                 ┌────────────────────────────────┐
                 │  Learned Behaviors & Policies  │
- toxicity patterns           │
- no-trade regimes            │
- inventory handling           │
                 └─────────────────┬──────────────┘
            ┌────────────────────────────────────────────┐
            │ Production Execution System (C++/FPGA)- Risk Constraints from RL                │
- Behavior Heuristics from RL             │
- Regime Flags from RL                    │
            └────────────────────────────────────────────┘

This conveys the hybrid “RL assists, but does not control execution” architecture.

11. Final Takeaway: RL Is Not Alpha—It’s Insight

Most people think RL will find trade signals.
It won’t.

What it will find is:

  • boundary conditions,
  • risk cliffs,
  • microstructure traps,
  • where liquidity is fake,
  • which volatility patterns precede adverse selection,
  • when to stop quoting,
  • when spreads look normal but danger is rising.

And that’s where most edges actually live.

A good market-maker thrives not by trading more, but by avoiding bad trades with discipline.

RL, when properly integrated, becomes the first system I've seen that can learn discipline automatically.

That’s the real innovation.