qbrixqbrix
Library/Research

Distributed Bandits in Production: Theory, Architecture, and Trade-offs

Mustafa Eskin·March 2026·18 min read
Distributed Bandits in Production: Theory, Architecture, and Trade-offs

The single-machine illusion

A multi-armed bandit algorithm is deceptively simple. You have KK options — arms — each with an unknown reward distribution. You pull one, observe a reward, update your beliefs, and repeat. The goal is to maximize cumulative reward, or equivalently, minimize regret: the gap between what you earned and what you would have earned if you had always pulled the best arm.

On a single machine, this works beautifully. You run Thompson Sampling or UCB1, watch it converge, and the math lines up with the textbook. The algorithm sees every outcome the moment it happens. It updates immediately. Every decision reflects the full history of observations.

Then you try to put it into production.

Your service handles thousands of requests per second. It runs on multiple replicas behind a load balancer. Users don't give feedback instantly — some click after seconds, some after hours, some never. Training can't happen on every single event in real time. Suddenly, every assumption of the classical bandit model is violated.

This article is about what happens next — why distributed bandits are hard, what the theory says about the cost, and how to build systems that keep that cost small.

What is a distributed bandit?

A distributed bandit is what you get when a bandit algorithm operates across multiple machines that don't share state instantaneously. Instead of one agent making sequential decisions with immediate feedback, you have:

None of this exists in the textbook. The classical KK-armed bandit assumes a single agent, sequential decisions, and immediate reward observation. A distributed bandit relaxes all three.

Why bother? Because the alternative — funneling every decision through a single synchronized process — doesn't scale. If your system serves 10,000 selections per second, a single-threaded bandit with a database round-trip per decision is a bottleneck, not a solution. Distribution is the price of operating at production scale.

The question is whether that price is affordable.

A brief theory primer

To understand what distribution costs, we need to define what "optimal" means in the first place.

The performance of a bandit algorithm is measured by regret — the cumulative difference between the best arm's expected reward and what the algorithm actually earned:

RT=t=1T(μμAt)R_T = \sum_{t=1}^{T} (\mu^* - \mu_{A_t})

where μ\mu^* is the expected reward of the best arm, AtA_t is the arm chosen at time tt, and TT is the total number of rounds. Lower regret is better. Zero regret means you always picked the best arm, which is impossible without knowing the answer in advance.

Lai and Robbins [1] proved in 1985 that no algorithm can achieve regret better than:

lim infTRTlnTi:Δi>0ΔiKL(νi,ν)\liminf_{T \to \infty} \frac{R_T}{\ln T} \geq \sum_{i: \Delta_i > 0} \frac{\Delta_i}{\text{KL}(\nu_i, \nu^*)}

This is a fundamental lower bound. The Δi=μμi\Delta_i = \mu^* - \mu_i terms are the suboptimality gaps (how much worse each arm is), and KL\text{KL} is the Kullback-Leibler divergence between reward distributions. The bound says you must try suboptimal arms at least logarithmically many times to identify the best one. Exploration is not optional — it's information-theoretically required.

The good news: algorithms like Thompson Sampling [2, 3] and UCB1 [4] match this bound. They are optimal in the single-agent, immediate-feedback setting. The question for distributed systems is: how much regret do we add on top?

Three ways distribution breaks the model

When you distribute a bandit, three classical assumptions break simultaneously. Each creates its own form of information loss, and understanding them separately is essential before reasoning about how they compose.

1. Stale parameters

In the textbook, the agent updates after every observation. In production, selection servers read parameters from a cache — Redis, an in-memory store, whatever — that refreshes on a schedule. Between refreshes, every decision is based on parameters that don't reflect the latest training.

If the training process handles λ\lambda events per second and the cache TTL is τ\tau seconds, the maximum staleness is roughly λτ\lambda \cdot \tau missed observations. With λ=50\lambda = 50 and τ=60\tau = 60, that's 3,000 observations the selection server hasn't seen.

Three thousand sounds like a lot. But staleness is really just a special case of the next problem.

2. Batched training

Even without caching, real training systems don't update after every single event. They accumulate a batch — maybe 5 seconds' worth of feedback — train on all of it at once, and write the updated parameters. This means the time horizon TT is effectively divided into JJ batches, and within each batch the policy is frozen.

This is the batched bandit model. The key question: how many batches JJ do you need to preserve optimal regret? If you need JJ proportional to TT (one batch per round), you haven't gained anything. If you need far fewer, batching is essentially free.

3. Delayed feedback

This is the most fundamental of the three. A user receives a recommendation at time tt, but their feedback — a click, a purchase, a bounce — arrives at time t+dtt + d_t. The delay dtd_t could be milliseconds or hours. During that window, the system keeps making decisions without knowing the outcome of round tt.

Unlike staleness and batching, which are engineering choices you control, delay is an inherent property of the problem. You can't make users click faster. You can only decide how to handle the information gap.

What the theory says

The surprising result — and the reason distributed bandits work at all — is that the cost of these violations is bounded, quantifiable, and for the right algorithm, remarkably small.

Thompson Sampling shrugs off staleness

Kalkanli and Ozgur [5] studied Thompson Sampling in the batched setting and proved something striking:

Thompson Sampling achieves the same asymptotic regret in the batched setting as with instantaneous feedback, provided batch sizes grow subexponentially.

That means if batch sizes grow as any polynomial — Bj=jcB_j = j^c for any constant cc — Thompson Sampling is still optimal. In production, where the training interval is fixed (say 5 seconds), batch sizes grow linearly with request rate, which is far slower than exponential. You're well within the safe zone.

The intuition: Thompson Sampling works by sampling from a posterior distribution. Even when the posterior is stale, the randomness in the sampling provides exploration. The algorithm doesn't need exact state to make good decisions — it needs uncertainty, and staleness provides plenty of it.

UCB is more fragile

UCB algorithms don't enjoy the same free pass. The exploration bonus in UCB1:

At=argmaxi(μ^i+2lntNi(t))A_t = \arg\max_i \left( \hat{\mu}_i + \sqrt{\frac{2 \ln t}{N_i(t)}} \right)

depends on Ni(t)N_i(t), the exact number of times arm ii has been pulled. When parameters are stale, the agent underestimates NiN_i for recently-explored arms, inflating the confidence bound and causing redundant exploration.

UCB can achieve optimal regret in the batched setting, but it requires careful batch scheduling. Gao et al. [6] showed their BaSE (Batched Successive Elimination) algorithm needs O(loglogT)O(\log \log T) adaptively-sized batches. Esfandiari et al. [7] showed that O(logT)O(\log T) fixed batches give near-optimal regret with an overhead factor of 1+K/J1 + K/J. With 10 arms and 17,000 batches per day, that factor is 1.0006 — negligible, but the point stands: UCB needs more batches than Thompson Sampling to achieve the same guarantees.

Delay adds a bounded, additive cost

For Thompson Sampling with KK arms and average delay dd, Agrawal and Goyal [3] showed the regret becomes:

RT=O(KTlnK+Kd)R_T = O\left(\sqrt{KT \ln K} + Kd\right)

The KdKd term is additive and constant with respect to TT. Ten arms, 30-second average delay at 100 req/s: that's a one-time cost of ~300 in cumulative regret over millions of rounds. It doesn't grow.

For adversarial policies like EXP3, the story is worse. Li et al. [9] found a multiplicative cost hidden inside the square root of the delay bound. Adversarial algorithms pay more for the same latency — they need tighter feedback loops.

How the costs compose

In a real system, all three violations happen at once. For Thompson Sampling, the total regret decomposes as:

RTdist=RToracleideal algorithm+RTbatchbatching cost+RTdelaydelay costR_T^{\text{dist}} = \underbrace{R_T^{\text{oracle}}}_{\text{ideal algorithm}} + \underbrace{R_T^{\text{batch}}}_{\text{batching cost}} + \underbrace{R_T^{\text{delay}}}_{\text{delay cost}}

With concrete numbers — 10 arms, Δmin=0.01\Delta_{\min} = 0.01, 10M rounds, 5s training interval, 30s average delay:

ComponentRegret% of total
Oracle (Lai-Robbins)~16,00097.1%
Batching overhead~80.05%
Delay overhead~3001.8%
Total distributed~16,308100%

The overhead of distribution is less than 2%. The dominant cost is the information-theoretic minimum — the regret you'd pay even with a perfect, instantaneous, single-machine implementation.

Common pitfalls in production

Theory gives you bounds. Production gives you surprises. Here are the failure modes that don't show up in the math but will show up in your metrics.

The lost-update problem

If two training workers process the same experiment concurrently, you get the classic read-modify-write race: worker A reads parameters, worker B reads the same (now stale) parameters, both train, and whichever writes last silently overwrites the other's work. One batch of training vanishes.

This doesn't just add regret — it makes regret unpredictable. The system appears to work, metrics look plausible, but convergence is slower than it should be, and no error is ever raised. It's the worst kind of bug: the silent kind.

Feedback loss and double-counting

Feedback events flow through a pipeline — a message queue, a consumer, a database write. If the consumer crashes mid-batch and events aren't acknowledged properly, you either lose feedback (regret goes up) or reprocess it (the algorithm double-counts observations, biasing parameter estimates).

For Thompson Sampling with a Beta posterior, double-counting a success inflates the alpha parameter, making the posterior overconfident. The algorithm stops exploring an arm it should still be testing. For UCB, double-counting inflates pull counts, deflating the confidence bound — the opposite problem, causing under-exploration.

Both are bad. Both are silent. Both require exactly-once or at-least-once-with-idempotency semantics in your event pipeline.

Cold-start dead zones

A new experiment starts with uninformative priors — uniform Beta(1,1) for Thompson Sampling. Until the first training cycle completes (events selected → feedback arrives → batch accumulates → training runs → parameters written → cache refreshes), every selection is essentially random.

This cold-start window is typically 1–2 minutes. For high-traffic experiments it's fine — you burn through the random phase quickly. For low-traffic experiments, the cold-start period can dominate the experiment's entire lifetime. If an experiment gets 10 events per day, even a 2-minute random phase is irrelevant — but reaching statistical significance takes so long that the cold start is the least of your problems.

Non-stationary rewards

The theoretical results above assume stationary reward distributions — arm means don't change over time. In practice, they do. User preferences shift, inventory changes, seasonality kicks in. When μi\mu_i drifts, stale parameters track the wrong distribution, and the regret scales with the product of staleness and drift rate.

Sliding-window variants help (only train on recent data), but they introduce a new tradeoff: shorter windows adapt faster to drift but have higher variance from fewer samples. There's no free lunch here — you're trading staleness-robustness for drift-sensitivity, and the right balance depends on your domain.

Contextual bandits at scale

LinUCB and LinTS maintain per-arm design matrices DiRd×dD_i \in \mathbb{R}^{d \times d}. These are O(d2)O(d^2) parameters per arm, and stale design matrices mean the confidence ellipsoid doesn't reflect recent observations. The impact scales with feature dimension dd, and tight bounds for distributed contextual bandits remain an open theoretical problem.

In practice, contextual bandits work — the diversity of contexts provides a form of natural regularization — but you should expect higher sensitivity to staleness as dimensionality increases.

How qbrix handles this

Qbrix was designed around these constraints from the start. The architecture separates the hot path (selection) from the cold path (training), keeping the theoretical assumptions approximately satisfied while operating at production scale.

Hot path: stateless, cached selection

Selection servers are stateless replicas that read parameters from an in-memory TTL cache. When the cache is fresh, selection is a local memory read plus a posterior sample — microseconds. When the TTL expires, the server returns a result from stale parameters immediately (no blocking) and refreshes asynchronously in the background. This is the stale-while-revalidate pattern, applied to bandit parameters.

The only blocking I/O happens on cold start — the first request for a new experiment on a new replica. Every subsequent request is served from memory. This means selection latency is decoupled from training latency, which is exactly what you want: slow training should not slow down decisions.

Cold path: single-writer training

The training service runs as a single instance consuming from a durable event stream (Redis Streams with consumer groups and explicit acknowledgment). A dispatcher routes events to per-experiment queues, and a pool of async workers trains experiments in parallel — but each experiment is trained by at most one worker at a time.

This single-writer constraint eliminates the lost-update problem entirely. It's a correctness guarantee, not a performance limitation — the bottleneck in training is almost never CPU, it's waiting for feedback to arrive.

Feedback integrity: signed tokens

At selection time, the full context — tenant, experiment, arm, features, timestamp — is serialized and signed with HMAC-SHA256. This token is returned to the client as the request ID. When feedback arrives, the token is decoded and verified, recovering everything needed to create a training event.

No server-side session storage. No cross-replica state. No database lookup to correlate selections with feedback. The token is the state. This eliminates an entire class of feedback-loss bugs — if the token verifies, the feedback event is complete and correct.

Durable event pipeline

Feedback flows through Redis Streams with consumer groups. Events are explicitly acknowledged after processing. If the training service crashes, unacknowledged events are redelivered to another consumer in the group. Combined with idempotent training (deduplication by event ID), this gives effectively-once processing semantics.

Policy-aware defaults

Because Thompson Sampling is the most staleness-robust algorithm (per the theoretical results above), qbrix defaults to it for stochastic settings. For adversarial environments where EXP3 is appropriate, the system automatically applies shorter cache TTLs and faster training cycles — encoding the theoretical sensitivity into operational defaults so you don't have to tune it yourself.

The gap is smaller than you think

The central takeaway is quantitative: for Thompson Sampling in a well-architected distributed system, the gap between the theoretical optimum and what you actually get is less than 2% of the oracle regret. Distribution doesn't break bandits — it adds a small, bounded, and predictable cost.

This isn't an accident. Thompson Sampling was designed in 1933 as a Bayesian algorithm that naturally handles uncertainty. Distribution adds uncertainty about the current state of the world — stale parameters, missing feedback, batched updates — and Thompson Sampling's response to uncertainty is to explore. That's exactly what you want when your information is incomplete.

The engineering challenge is keeping the theoretical assumptions approximately true: bound staleness with cache TTLs, prevent feedback loss with durable streams, avoid lost updates with single-writer training, and eliminate coordination overhead with signed tokens. Each architectural choice maps to a specific theoretical requirement.

The math works. The systems work. The gap is smaller than you think.


References

[1] T. L. Lai and H. Robbins, "Asymptotically efficient adaptive allocation rules," Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[2] W. R. Thompson, "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples," Biometrika, vol. 25, no. 3–4, pp. 285–294, 1933.

[3] S. Agrawal and N. Goyal, "Analysis of Thompson Sampling for the multi-armed bandit problem," in Proc. COLT, 2012. arXiv:1111.1797.

[4] P. Auer, N. Cesa-Bianchi, and P. Fischer, "Finite-time analysis of the multiarmed bandit problem," Machine Learning, vol. 47, no. 2–3, pp. 235–256, 2002.

[5] C. Kalkanli and A. Ozgur, "Asymptotic performance of Thompson Sampling in the batched multi-armed bandits," in Proc. IEEE ISIT, 2021. arXiv:2110.00158.

[6] Z. Gao, Y. Han, Z. Ren, and Z. Zhou, "Batched multi-armed bandits problem," in Proc. NeurIPS, 2019. arXiv:1904.01763.

[7] H. Esfandiari, A. Karbasi, A. Mehrabian, and V. Mirrokni, "Regret bounds for batched bandits," in Proc. AAAI, 2019. arXiv:1910.04959.

[8] S. Guha, K. Munagala, and M. Pal, "Multi-armed bandit problems with delayed feedback," arXiv:1011.1161, 2010.

[9] Y. Li, J. Guo, Y. Li, T. Wang, and W. Jia, "Adversarial bandits with multi-user delayed feedback: Theory and application," arXiv:2310.11188, 2023.