There is a story I wanted to be able to tell honestly before I told it publicly, and it goes something like this: a production multi-armed bandit, running over a real distributed system with all of its staleness and batching and network hops, can outperform a traditional A/B test by a margin that is not only statistically significant, but operationally consequential.
I believed this was true. The literature says it should be. But the qbrix architecture is an interesting test case, because it pays more than one form of overhead. Selection happens on one fleet of stateless services, training happens on another, and the parameters that connect them live in Redis — cached, refreshed on a schedule, occasionally stale by several seconds. Feedback travels through a message queue and is processed in batches. None of that is free in regret terms, and I have spent enough time with the theoretical bounds to know exactly how it is supposed to cost me.
On top of that, qbrix's default policy for an experiment where you do not specify otherwise is not a single textbook algorithm. It is auto, which inside the system resolves to a meta-bandit — EXP3 running at the meta-level, adaptively routing traffic across a portfolio of underlying learners. The design is intentional: users should not have to know whether their problem is stationary, whether the reward distribution is well-behaved, or whether they need protection against drift. The meta-bandit figures it out. But that flexibility costs regret too — the meta-level selection has its own exploration cost on top of whatever the chosen learner is already paying.
So this is the claim I wanted to test: qbrix's default configuration, running through the real distributed stack, with the meta-level tax included, still beats a clean A/B test by enough to matter. No shortcuts. No in-process bandit. No assumption that the right algorithm was chosen in advance.
I built a simulator to find out. The results are surprising in the direction of the prior, which is not usually how these things go.
The setup is deliberately unglamorous. A binary-reward environment with three arms — one genuinely better at 12% conversion, one genuinely worse at 9.5%, and one at the 10% baseline. This is the shape of an optimization problem that is common and hard in equal measure: a 20% relative lift buried under 90% noise. Sample size was derived from a two-proportion z-test at α = 0.05 and 80% power, which lands at 4,648 users per arm, or 13,944 users per full run. Twenty independent runs per strategy, each with its own seed, each fully reproducible.
Three strategies competed. A uniform random baseline, which is what decision-making looks like with no learning at all — it is there to calibrate what the noise floor of the metric actually is. A fixed-split A/B test, which is the industry convention: allocate evenly across all three arms for the full duration of the experiment, then declare a winner based on observed means. And qbrix, configured with the auto policy and nothing else, running through a real deployment — every selection an HTTP call to the proxy service, routed via gRPC to the motor service, served from the Redis parameter cache, with every feedback event published to a stream and consumed by the cortex training service in batches. This is not a lab-bench run; it is the production stack under simulated load.
What came back was this.
| Strategy | Mean conversions | 95% CI | Mean regret | Best arm found |
|---|---|---|---|---|
| Uniform random | 1,436.2 | [1,421.5, 1,450.9] | 231.9 | 50% |
| Fixed-split A/B | 1,447.5 | [1,432.4, 1,462.5] | 232.4 | 100% |
qbrix auto | 1,522.8 | [1,502.4, 1,543.2] | 153.4 | 95% |
The confidence intervals on reward do not touch. A Welch's t-test on the difference between qbrix and A/B returns t = 5.82 at 35 degrees of freedom, which places the probability of observing this gap by chance somewhere in the region of one in ten million. In practical terms, the result is unambiguous: qbrix delivered 75 more conversions per run, a 5.21% relative lift in total reward, with enough statistical separation that no amount of re-running the benchmark is going to change the answer.
The more interesting number, though, is regret — the cumulative gap between what each strategy earned and what it would have earned if it had known the best arm from the start and committed to it immediately. Regret is the quantity bandit theory is built around, because it is the quantity that ties an algorithm's behavior to what it actually costs you to run.
A/B testing paid 232.4 regret per run. Not on average — exactly. The standard deviation across twenty independent runs was on the order of 10⁻¹⁴, which is the numerical floor of double-precision floating point. Every run paid the same number to three decimal places.
This is not a property of the simulator. It is the defining property of fixed-split A/B testing, and I think it deserves more attention than it usually gets. When you commit to an equal allocation for the duration of the experiment, regret is a deterministic function of the arm means and the sample size. It does not depend on the data you observe midway through. The allocation is frozen at the start and the regret is paid in full regardless — because that is what the protocol requires.
A/B testing does not have variance on regret because A/B testing does not learn from itself during the experiment. It runs the plan until the plan is done, and the cost is the cost.
qbrix, on the same twenty runs, paid 153.4 regret on average, 95% CI [143.7, 163.0]. That is a 34% reduction, with a t-statistic of -16.0 on the difference. The variance around qbrix's regret is real and it is informative — some runs pay 114, others pay 195, depending on whether the early draws were kind. That variance is the algorithm noticing things: occasionally it catches a bad early sequence and spends a few more trials before committing, which it then recovers from because the meta-bandit is continuously shifting traffic away from whichever underlying learner is underperforming in the moment. On most runs it moves quickly. On a few it moves slowly. On all of them it moves, which is the part that matters.
The 34% gap is what I care about, because it is the gap that survived the distribution tax, the meta-level exploration tax, and the decision to not tell the system in advance which algorithm was right for the problem. It is what is left after the theory's worst-case overheads have already been priced in.
Translating that regret number into business terms is straightforward, and I think it is worth doing explicitly, because percentages are easy to discount and dollars are not.
Seventy-five extra conversions per 13,944 users, at a modest $50 average order value, is $3,767 of additional revenue per experiment cycle. That is a single 13,944-user run. But product teams do not run a single experiment and stop; optimization is continuous, and the relevant frame is monthly traffic.
Here is what the same 0.54-point conversion lift looks like at different scales, assuming the same $50 AOV:
| Monthly decisions | Extra conversions / month | Extra revenue / month | Extra revenue / year |
|---|---|---|---|
| 100,000 | 540 | $27,000 | $324,000 |
| 1,000,000 | 5,400 | $270,000 | $3.24M |
| 10,000,000 | 54,000 | $2.7M | $32.4M |
These are not projections. They are arithmetic on the measured conversion gap. A company doing a million decisions a month picks up $270,000 of additional revenue — extracted from decisions the A/B test was already making, just worse. At ten million decisions a month, which is consumer-scale traffic, it is $2.7M monthly or $32.4M annualized. Same users, same funnel, same engineering team, same product. The only thing that changed was how traffic was allocated during the learning period.
These numbers assume nothing heroic about the qbrix deployment. The benchmark included every production overhead I could think to include: network latency between services, Redis cache staleness bounded by TTL, asynchronous feedback processing with a batching interval, and the meta-level EXP3 exploration cost stacked on top of whichever underlying learner was being tried. The 34% regret reduction is the net of all of it.
There is one counter-argument people reach for when confronted with numbers like these, and it shows up in the benchmark too: the fixed A/B split identified the correct best arm in 100% of its runs, while qbrix identified it in 95%. One run in twenty, qbrix ended the experiment slightly favoring the wrong variant by the end. The A/B test did not.
This is usually framed as a reliability argument, and in a narrow sense it is accurate. But I think it is the wrong frame. The A/B test's 100% identification rate comes bundled with the guaranteed 232.4 regret cost, paid in full on every run — including every run where it "correctly" identified the winner. What you are paying for, with A/B testing, is not reliability of outcome; it is uniformity of outcome, at a price that is uniformly high. qbrix's 95% identification rate comes bundled with one-third less cumulative regret on every run, which is a very different trade.
The framing I prefer is that A/B testing buys certainty about the declared label at the cost of certainty about the regret budget. qbrix buys a small amount of label uncertainty in exchange for a meaningful reduction in waste. If the question is "which tool should make a regulatory decision that has to be defensible in court," A/B testing is the honest answer. If the question is "which tool should allocate traffic through a product funnel that is serving users continuously," the trade is, I think, not close.
And 95% identification, on a problem with an effect this small, is a level of accuracy most real experiments would be happy to achieve at any regret cost.
The part of this benchmark that was genuinely instructive for me, separate from the headline numbers, was watching the qbrix regret distribution and realizing how much of the story it tells. A/B testing's zero-variance regret is not a quirk — it is the consequence of a protocol that was designed in a world where experiments were discrete, expensive events run sequentially, one at a time, and the goal was to produce a single clean answer. In that world, paying a fixed regret budget in exchange for a crisp p-value was a sensible trade.
The world in which qbrix is deployed is not that world. Decisions arrive one user at a time. Every request is simultaneously an opportunity to learn and an opportunity to waste. A protocol that treats request 13,000 the same as request 1, regardless of everything it has observed in between, is failing to use signal it has already paid for. That is the cost. It shows up deterministically every time. And it compounds every month the system runs.
The qbrix architecture was built on the premise that continuous learning, under all the real constraints of a distributed system, would produce better outcomes than discrete testing under idealized ones. The benchmark numbers I have shown here — 75 more conversions per run, a 34% reduction in regret, a 5.21% lift in total reward, confidence intervals that do not touch — are the empirical form of that premise. The theory said the gap should be there. The data says it is larger than I expected, and it survives every overhead the system imposes.
If you are running a product experiment right now with a fixed 50/50 or 33/33/33 allocation, multiply your monthly volume by a half-percentage-point of additional conversion rate, multiply that by your average order value, and look at the number. It is usually a number that changes what you do next quarter.
The gap is not a claim; it is a measurement.