qbrixqbrix
Library/Guide

A Developer's Guide to Multi-Armed Bandits

Mustafa Eskin·March 2026·14 min read
A Developer's Guide to Multi-Armed Bandits

A Developer's Guide to Multi-Armed Bandits

You have five checkout button variants. You need to figure out which one converts best. The traditional approach — an A/B test — splits traffic equally across all five, waits for statistical significance, and then ships the winner. Simple, but wasteful: during the entire experiment, 80% of your traffic is being shown suboptimal variants by design.

Multi-armed bandits are a family of algorithms that solve this differently. Instead of fixed traffic splits, they learn while they serve — dynamically shifting traffic toward better-performing options as evidence accumulates, without ever formally "ending" the experiment.

This guide walks through the core algorithms, their mathematical intuition, and Python implementations. By the end, you'll understand not just what each algorithm does, but when to reach for it.


The problem, formally

Imagine a row of slot machines — "one-armed bandits" — each with an unknown payout distribution. You have a limited budget of pulls. Your goal: maximize your total reward.

At each step, you face a choice:

Pull only the current best and you might miss a better option. Explore too much and you waste pulls on bad options you've already identified. This is the exploration-exploitation tradeoff — the central tension in sequential decision-making under uncertainty.

The formal objective is to minimize cumulative regret: the difference between the total reward you could have earned (by always pulling the best arm) and what you actually earned with your strategy.

RT=Tμt=1TμatR_T = T \cdot \mu^* - \sum_{t=1}^{T} \mu_{a_t}

Where μ\mu^* is the true mean of the best arm, and μat\mu_{a_t} is the true mean of the arm chosen at time tt.

A good bandit algorithm keeps regret growing sublinearly — meaning the per-step penalty shrinks toward zero as the algorithm learns.


Three families of bandit algorithms

Bandit algorithms fall into three broad categories, defined by their assumptions about how rewards are generated:

Stochastic bandits assume each arm's rewards come from a fixed (but unknown) probability distribution. The environment is stationary — arm quality doesn't change over time. This covers most product optimization scenarios: conversion rates, click-through rates, revenue per session.

Contextual bandits extend stochastic bandits by incorporating features about the current decision. Instead of a single global "best arm," the optimal choice depends on who the user is, what page they're on, or what time of day it is. This is personalization.

Adversarial bandits make no assumptions about reward generation at all. Rewards can be chosen by an adversary trying to maximize your regret. These algorithms are robust to non-stationary environments, concept drift, and worst-case scenarios.

Let's build up from the simplest to the most sophisticated.


Stochastic bandits

Epsilon-Greedy: the baseline

The simplest algorithm that actually works. With probability ϵ\epsilon, pick a random arm (explore). Otherwise, pick the arm with the highest observed mean reward (exploit).

import numpy as np
import random
 
class EpsilonGreedy:
    def __init__(self, n_arms: int, epsilon: float = 0.1, decay: float = 0.0):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.decay = decay
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
 
    def select(self) -> int:
        if random.random() < self.epsilon:
            return random.randrange(self.n_arms)
        return int(np.argmax(self.values))
 
    def update(self, arm: int, reward: float):
        self.counts[arm] += 1
        # incremental mean update
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]
        # decay exploration rate
        self.epsilon *= (1 - self.decay)

When to use it: Epsilon-Greedy is the algorithm you reach for when you need something working in ten minutes. It handles any reward type (binary, bounded, continuous), has one parameter to tune (ϵ\epsilon), and is trivial to implement.

The catch: exploration is undirected. It explores arms uniformly at random, regardless of how much or how little is known about each one. An arm with 10,000 observations gets explored just as often as one with 3. This makes it sample-inefficient compared to smarter alternatives.

The optional decay parameter (γ\gamma) lets you anneal exploration over time — starting broad and narrowing as confidence grows.


Thompson Sampling: the Bayesian approach

Thompson Sampling is elegant in both concept and practice. For each arm, maintain a probability distribution representing your belief about its true reward rate. At each step, sample from each arm's distribution and pick the arm whose sample is highest.

For binary rewards (click / no-click), the natural choice is the Beta distribution:

class ThompsonSampling:
    def __init__(self, n_arms: int, alpha_prior: float = 1.0, beta_prior: float = 1.0):
        self.n_arms = n_arms
        # alpha = prior successes + 1, beta = prior failures + 1
        self.alpha = np.full(n_arms, alpha_prior)
        self.beta = np.full(n_arms, beta_prior)
 
    def select(self) -> int:
        # sample from each arm's posterior
        samples = np.random.beta(self.alpha, self.beta)
        return int(np.argmax(samples))
 
    def update(self, arm: int, reward: float):
        # binary: reward > 0.5 counts as success
        if reward > 0.5:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

The beauty of Thompson Sampling is that exploration happens automatically through the variance of the posterior distribution. An arm with few observations has a wide distribution — its samples will occasionally be very high, causing it to be explored. An arm with many observations has a narrow distribution concentrated around its true mean — it gets exploited when it's genuinely good and ignored when it's not.

No exploration parameter to tune. No schedule to design. The posterior does the work.

For continuous rewards, swap the Beta distribution for a Gaussian posterior:

class GaussianThompsonSampling:
    def __init__(self, n_arms: int, prior_mean: float = 0.0,
                 prior_precision: float = 1.0, noise_precision: float = 1.0):
        self.n_arms = n_arms
        self.noise_precision = noise_precision
        self.mean = np.full(n_arms, prior_mean)
        self.precision = np.full(n_arms, prior_precision)
 
    def select(self) -> int:
        samples = [
            np.random.normal(self.mean[i], 1.0 / np.sqrt(self.precision[i]))
            for i in range(self.n_arms)
        ]
        return int(np.argmax(samples))
 
    def update(self, arm: int, reward: float):
        new_precision = self.precision[arm] + self.noise_precision
        self.mean[arm] = (
            self.precision[arm] * self.mean[arm] + self.noise_precision * reward
        ) / new_precision
        self.precision[arm] = new_precision

When to use it: Thompson Sampling is the default recommendation for most stochastic bandit problems. It achieves near-optimal regret bounds, is straightforward to implement, and requires no hyperparameter tuning beyond the choice of prior. It's particularly strong for binary and bounded rewards.


UCB1-Tuned: the frequentist approach

Upper Confidence Bound algorithms take a different philosophy: instead of sampling randomly, they compute a deterministic confidence interval for each arm and always select the arm with the highest upper bound.

The intuition: be optimistic in the face of uncertainty. If you haven't tried an arm much, its confidence interval is wide, so its upper bound is high — forcing exploration. As you observe more, the interval shrinks, and only genuinely good arms maintain high bounds.

UCB1-Tuned improves on the classic UCB1 by incorporating the observed variance of each arm's rewards:

import math
 
class UCB1Tuned:
    def __init__(self, n_arms: int, alpha: float = 2.0):
        self.n_arms = n_arms
        self.alpha = alpha
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.sq_values = np.zeros(n_arms)  # sum of squared rewards
        self.total = 0
 
    def select(self) -> int:
        # ensure each arm is tried at least once
        for i in range(self.n_arms):
            if self.counts[i] == 0:
                return i
 
        ucb_values = []
        for i in range(self.n_arms):
            mean = self.values[i]
            variance = self.sq_values[i] / self.counts[i] - mean ** 2
            log_term = math.log(self.total + 1)
            delta = math.sqrt(self.alpha * log_term / self.counts[i])
            bound = min(0.25, variance + delta)
            ucb = mean + math.sqrt(bound * log_term / self.counts[i])
            ucb_values.append(ucb)
 
        return int(np.argmax(ucb_values))
 
    def update(self, arm: int, reward: float):
        self.counts[arm] += 1
        self.total += 1
        self.sq_values[arm] += reward ** 2
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]

When to use it: UCB algorithms are deterministic — given the same state, they always make the same choice. This makes them easier to reason about, debug, and reproduce. UCB1-Tuned is a strong choice when you want consistent behavior and bounded rewards, and when the variance across arms differs significantly (the variance-awareness helps it adapt).

Thompson Sampling vs. UCB: In practice, Thompson Sampling tends to outperform UCB on regret metrics, but UCB offers determinism and cleaner theoretical guarantees. Both are production-viable.


Contextual bandits

Standard bandits treat every decision as identical: "which button color is best?" Contextual bandits ask a richer question: "which button color is best for this user, on this device, at this time?"

The key idea: the expected reward of each arm is a function of a context vector — a set of features describing the current situation.

LinUCB: linear upper confidence bounds

LinUCB models the reward of each arm as a linear function of the context: rθaxr \approx \theta_a^\top x, where θa\theta_a is a parameter vector per arm and xx is the context vector.

It uses ridge regression to estimate θa\theta_a and constructs confidence bounds using the uncertainty in the estimate:

class LinUCB:
    def __init__(self, n_arms: int, dim: int, alpha: float = 1.5):
        self.n_arms = n_arms
        self.dim = dim
        self.alpha = alpha
        # design matrix (identity = ridge regularization)
        self.A = [np.eye(dim) for _ in range(n_arms)]
        # reward-weighted context accumulator
        self.b = [np.zeros((dim, 1)) for _ in range(n_arms)]
 
    def select(self, context: np.ndarray) -> int:
        x = context.reshape(-1, 1)
        ucb_values = []
 
        for arm in range(self.n_arms):
            A_inv = np.linalg.inv(self.A[arm])
            theta = A_inv @ self.b[arm]               # ridge regression estimate
            mean = float(theta.T @ x)                  # predicted reward
            confidence = self.alpha * float(
                np.sqrt(x.T @ A_inv @ x)               # context-weighted uncertainty
            )
            ucb_values.append(mean + confidence)
 
        return int(np.argmax(ucb_values))
 
    def update(self, arm: int, context: np.ndarray, reward: float):
        x = context.reshape(-1, 1)
        self.A[arm] += x @ x.T         # rank-1 design matrix update
        self.b[arm] += reward * x       # reward-weighted context

The context vector might include: user segment, device type, time of day, page category, or any feature you believe influences the reward. The algorithm learns which features matter and how they interact with each arm.

When to use it: When the "best" option genuinely depends on who's asking. Personalized recommendations, content targeting, dynamic pricing per segment. LinUCB is the workhorse of contextual bandits — simple, efficient, and well-understood. It scales linearly with the context dimension and number of arms.


Adversarial bandits

Stochastic and contextual bandits assume rewards come from fixed distributions. But what if the environment is actively changing? Seasonal behavior shifts, competitor actions, trending content — all violate the stationarity assumption.

Adversarial bandit algorithms make no assumptions about how rewards are generated. They guarantee bounded regret even in the worst case.

EXP3: exponential-weight algorithm for exploration and exploitation

EXP3 maintains a weight for each arm and selects arms probabilistically, mixing the weighted distribution with uniform exploration:

class EXP3:
    def __init__(self, n_arms: int, gamma: float = 0.1):
        self.n_arms = n_arms
        self.gamma = gamma
        self.weights = np.ones(n_arms)
 
    def _probabilities(self) -> np.ndarray:
        w_sum = self.weights.sum()
        return (1 - self.gamma) * (self.weights / w_sum) + (self.gamma / self.n_arms)
 
    def select(self) -> int:
        probs = self._probabilities()
        return int(np.random.choice(self.n_arms, p=probs))
 
    def update(self, arm: int, reward: float):
        probs = self._probabilities()
        # importance-weighted reward estimate
        estimated_reward = reward / probs[arm]
        # exponential weight update
        self.weights[arm] *= np.exp(estimated_reward * self.gamma / self.n_arms)
        # normalize to prevent overflow
        self.weights /= self.weights.sum()

The importance weighting (reward / probability) corrects for the fact that we only observe the reward for the chosen arm. Arms that are selected less frequently get larger weight updates when they do well — ensuring they're not forgotten.

When to use it: Non-stationary environments where reward distributions shift over time. Content recommendations where popularity changes rapidly, pricing in competitive markets, or any scenario where you suspect the "best" option changes.

The tradeoff: adversarial algorithms pay for their robustness with higher regret in stationary environments. If your problem is genuinely stationary, a stochastic algorithm will learn faster.


Choosing the right algorithm

The choice depends on three factors: your reward type, whether context matters, and whether the environment is stationary.

ScenarioAlgorithmWhy
Binary rewards (click/no-click), stationaryThompson Sampling (Beta)Near-optimal, no tuning needed
Continuous rewards, stationaryThompson Sampling (Gaussian)Natural extension to unbounded rewards
Bounded rewards, need determinismUCB1-TunedDeterministic, variance-aware
Need something simple, any reward typeEpsilon-GreedyWorks everywhere, easy to understand
Best option depends on user featuresLinUCB or LinTSPersonalization with context vectors
Non-stationary or adversarial environmentEXP3Worst-case guarantees, no stationarity assumed

If you're unsure: start with Thompson Sampling. It's the closest thing to a universal default — strong empirical performance, minimal configuration, and it handles the exploration-exploitation tradeoff automatically through its Bayesian posterior.


Real-world applications

Multi-armed bandits aren't a theoretical curiosity. They power decisions at scale across industries:

Conversion optimization. Which headline, CTA, or layout converts best? Bandits find the winner faster than A/B tests by reducing traffic to losing variants early. The savings compound: a bandit running across 10 variants recovers most of its "exploration cost" within the first few hundred observations.

Recommendation systems. Which product, article, or video should we surface next? Contextual bandits personalize recommendations using user features — browsing history, device, location — without requiring a full collaborative filtering pipeline.

Dynamic pricing. What price point maximizes revenue for this product, in this market, right now? Bandits continuously adapt to demand elasticity changes, competitive moves, and seasonal shifts.

Feature rollouts. Should we show the new feature to this user? Bandits can gradually roll out features, automatically increasing exposure when engagement signals are positive and pulling back when they're not.

Ad placement. Which ad creative performs best for this impression? The non-stationary nature of ad performance (creative fatigue, audience shifts) makes adversarial bandits particularly valuable here.


Going further

The algorithms in this guide are the building blocks. Production bandit systems add layers on top: batched updates for high-throughput systems, delayed reward handling for long conversion windows, and multi-objective optimization when you're balancing clicks, revenue, and engagement simultaneously.

The core insight remains the same: instead of splitting your decisions into "learning" and "doing" phases, let the algorithm learn by doing — continuously, adaptively, and with mathematical guarantees on how much learning costs you along the way.


qbrix provides production-grade implementations of all the algorithms discussed here — Thompson Sampling, UCB, Epsilon-Greedy, LinUCB, EXP3, and more — as a managed optimization layer with sub-millisecond selection latency and continuous parameter updates.