Policies

qbrix ships 16 multi-armed bandit policies organized into three categories: stochastic, contextual, and adversarial. Each policy implements a different exploration-exploitation strategy. If you don't want to pick one yourself, the auto meta-bandit does it for you.

Auto (Meta-Bandit)

The auto policy is the recommended default. Instead of committing to a single algorithm, qbrix launches a portfolio of learners in parallel and uses a meta-level EXP3 controller to adaptively route traffic toward whichever learner is performing best on your actual data. The portfolio is scoped automatically to your reward type and context settings.

Under the hood:

  • A parent meta experiment runs MetaBanditPolicy (EXP3 at the meta level).
  • Several learner experiments run concrete policies from the appropriate category (e.g. BetaTSPolicy, UCB1TunedPolicy, EpsilonPolicy for binary rewards; LinTSPolicy, LogisticTSPolicy, GLMUCBPolicy for contextual settings).
  • Every select call is first routed by the meta controller to one of the learners, which then picks an arm. Feedback is credited to both levels.
  • As rewards come in, EXP3 shifts meta-level weight toward the learners that convert best — so the portfolio self-tunes to your environment.
ParameterTypeDefaultDescription
reward_typestringbinary, bounded, or continuous. Scopes the learner portfolio.
use_contextboolfalseIf true, contextual learners (LinTS, LogisticTS, GLMUCB) are included.
dimintContext vector dimension. Required when use_context=true.
  • Best for: Most production use cases. When you don't know which algorithm is right, or when reward structure may shift over time.
  • Pros: Zero-choice operation; robust across regimes; learners that underperform automatically get less traffic.
  • Cons: Slightly more infrastructure (M+1 experiments per parent); meta-level exploration adds a small amount of regret vs. the single best algorithm in hindsight.

Example: Auto Policy

from qbrix import Qbrix
 
client = Qbrix()
 
# binary reward, no context — qbrix picks a stochastic portfolio
experiment = client.experiment.create(
    name="checkout-banner",
    pool_id=pool.id,
    policy="auto",
    policy_params={"reward_type": "binary"},
)
 
# contextual auto — adds LinTS / LogisticTS / GLMUCB to the portfolio
personalized = client.experiment.create(
    name="personalized-hero",
    pool_id=pool.id,
    policy="auto",
    policy_params={
        "reward_type": "binary",
        "use_context": True,
        "dim": 10,
    },
)
Info

You can always override auto with a specific policy name if you want full control. The concrete policies are documented below, and the flowchart can help you choose.

Choosing a Policy

Prefer to pick a concrete algorithm yourself? Answer a few questions to find the right one:

Do you have per-request user features?

Stochastic Policies

These assume rewards are drawn from a stationary distribution. Best for standard A/B testing and optimization scenarios.

BetaTSPolicy

Thompson Sampling with Beta priors. The recommended default for binary rewards.

ParameterTypeDefaultDescription
alpha_priorfloat1.0Beta prior alpha (successes)
beta_priorfloat1.0Beta prior beta (failures)
  • Reward type: Binary (0 or 1)
  • Best for: Click-through optimization, conversion rate testing
  • Pros: Naturally balances exploration/exploitation, fast convergence
  • Cons: Only supports binary rewards

DiscountedTSPolicy

Discounted Thompson Sampling for non-stationary environments. Geometrically decays old observations by gamma on each update. Effective memory window ≈ 1 / (1 - gamma).

ParameterTypeDefaultDescription
alpha_priorfloat1.0Beta prior alpha (successes)
beta_priorfloat1.0Beta prior beta (failures)
gammafloatDiscount factor, required (0 < gamma < 1)
  • Reward type: Binary or Bounded
  • Best for: Environments where reward distributions shift over time (seasonal trends, changing user behaviour)
  • Pros: Adapts to distribution shifts without resetting; tunable memory window via gamma
  • Cons: Requires choosing gamma; too-low gamma forgets useful history, too-high gamma is slow to adapt

GaussianTSPolicy

Thompson Sampling with Gaussian priors. For continuous reward values.

ParameterTypeDefaultDescription
mu_priorfloat0.0Prior mean
sigma_priorfloat1.0Prior standard deviation
  • Reward type: Continuous (any float)
  • Best for: Revenue optimization, time-on-page, engagement scores
  • Pros: Handles continuous rewards, principled Bayesian updates
  • Cons: Assumes Gaussian reward distribution

UCB1TunedPolicy

Upper Confidence Bound with tuned variance. Deterministic, no randomness in selection.

ParameterTypeDefaultDescription
No configurable parameters
  • Reward type: Continuous
  • Best for: When you want deterministic, reproducible selections
  • Pros: Strong theoretical guarantees, no randomness
  • Cons: Can over-explore in practice

KLUCBPolicy

KL-divergence based Upper Confidence Bound. Optimal for Bernoulli rewards.

ParameterTypeDefaultDescription
cfloat0.0Exploration constant
  • Reward type: Binary
  • Best for: Binary rewards when you want minimax-optimal regret
  • Pros: Asymptotically optimal for Bernoulli bandits
  • Cons: More computationally expensive than BetaTS

EpsilonPolicy

Epsilon-greedy. The simplest bandit algorithm. Explores with probability epsilon, exploits otherwise.

ParameterTypeDefaultDescription
epsilonfloat0.1Exploration probability (0-1)
  • Reward type: Any
  • Best for: Baselines, simple scenarios, when you want explicit control over exploration rate
  • Pros: Dead simple, easy to reason about
  • Cons: Wastes exploration budget on known-bad arms

MOSSPolicy

Minimax Optimal Strategy in the Stochastic case. Requires knowing the time horizon in advance.

ParameterTypeDefaultDescription
n_horizonint1000Total number of rounds
  • Reward type: Continuous
  • Best for: Fixed-duration campaigns where the total rounds are known
  • Pros: Minimax-optimal regret bound
  • Cons: Requires specifying horizon upfront

MOSSAnyTimePolicy

Anytime variant of MOSS. No need to specify the horizon.

ParameterTypeDefaultDescription
alphafloat2.0Exploration parameter
  • Reward type: Continuous
  • Best for: Open-ended optimization without a known end date
  • Pros: No horizon needed, near-optimal regret
  • Cons: Slightly worse constant than MOSS with known horizon

RandomPolicy

Uniform random arm selection. No learning. Used as an A/B testing baseline, holdout control, or warm-start data collection phase.

ParameterTypeDefaultDescription
No configurable parameters
  • Reward type: Binary, Bounded, or Continuous
  • Best for: Establishing a baseline, pure random holdout groups, collecting initial data before switching to a learning policy
  • Pros: Zero bias, trivial to reason about, compatible with all reward types
  • Cons: No learning — use only as a short-term baseline or control group

Contextual Policies

These use per-request feature vectors to personalize selections. The context vector is passed with each select request.

LinUCBPolicy

Linear Upper Confidence Bound. Models reward as a linear function of context features.

ParameterTypeDefaultDescription
alphafloat1.0Exploration parameter
context_dimintDimension of context vector (required)
  • Reward type: Continuous
  • Best for: Personalized recommendations with user features
  • Pros: Deterministic, strong theoretical guarantees
  • Cons: Assumes linear reward model

LinTSPolicy

Linear Thompson Sampling. Bayesian approach to contextual bandits.

ParameterTypeDefaultDescription
sigmafloat1.0Prior variance
context_dimintDimension of context vector (required)
  • Reward type: Continuous
  • Best for: Personalized recommendations when you want randomized exploration
  • Pros: Better empirical performance than LinUCB in many settings
  • Cons: Assumes linear reward model, more computation per selection

LogisticTSPolicy

Laplace-approximated Logistic Thompson Sampling. Maintains per-arm weight vectors and a diagonal Hessian approximation. At selection time, samples from the posterior N(w, diag(1/h)) and scores each arm via the logistic function.

ParameterTypeDefaultDescription
dimintDimension of context vector (required)
lambda_float1.0L2 regularization strength
lrfloat0.1Learning rate for weight updates
  • Reward type: Binary (0 or 1)
  • Best for: Ads, recommendations, personalization — any binary outcome with contextual features
  • Pros: Most widely deployed contextual bandit for binary rewards; principled Bayesian exploration via posterior sampling
  • Cons: Diagonal Hessian is an approximation; may under-explore when posterior is poorly calibrated early on

GLMUCBPolicy

GLM-UCB (Logistic UCB). Fits a logistic regression model per arm and selects using an upper confidence bound based on the logistic mean plus a scaled confidence width from the inverse Hessian.

ParameterTypeDefaultDescription
dimintDimension of context vector (required)
alphafloat1.5Exploration coefficient for UCB term
lambda_float1.0L2 regularization strength
lrfloat0.1Learning rate for weight updates
  • Reward type: Binary (0 or 1)
  • Best for: Binary rewards where deterministic, auditable selections are preferred over randomized sampling
  • Pros: Deterministic and easy to monitor; strong theoretical regret bounds for logistic models
  • Cons: Can over-explore when alpha is too large; tuning alpha requires care in production

Adversarial Policies

These make no assumptions about how rewards are generated. Use when rewards may be non-stationary or adversarially chosen.

EXP3Policy

Exponential-weight algorithm for Exploration and Exploitation. Uses multiplicative weight updates.

ParameterTypeDefaultDescription
gammafloat0.1Exploration mixing parameter (0-1)
  • Reward type: Any (bounded)
  • Best for: Non-stationary environments, game-theoretic settings
  • Pros: Works against any reward sequence
  • Cons: Higher regret than stochastic methods when rewards are actually stationary

EXP3IXPolicy

EXP3 with Implicit Exploration (Neu, 2015). Uses reward / (p + gamma) instead of reward / p for importance-weighted updates, eliminating the numerical instability of standard EXP3 when selection probabilities are small.

ParameterTypeDefaultDescription
gammafloat0.1Implicit exploration parameter
etafloat0.1Learning rate for weight updates
  • Reward type: Binary or Bounded
  • Best for: Adversarial or non-stationary settings where standard EXP3 shows numerical instability; production environments with many arms
  • Pros: Same worst-case regret guarantees as EXP3 but numerically stable; no reward clipping needed
  • Cons: Two hyperparameters to tune instead of one; marginal overhead vs. EXP3

FPLPolicy

Follow the Perturbed Leader. Adds random perturbations to cumulative rewards.

ParameterTypeDefaultDescription
etafloat1.0Perturbation scale
  • Reward type: Any (bounded)
  • Best for: Adversarial settings when you want a perturbation-based approach
  • Pros: Simple implementation, competitive with EXP3
  • Cons: Requires tuning eta for best performance

Example: Creating an Experiment with Policy Params

from qbrix import Qbrix
 
client = Qbrix()
 
experiment = client.experiment.create(
    name="personalized-pricing",
    pool_id="<pool-id>",
    policy="LinTSPolicy",
    policy_params={"sigma": 0.5, "context_dim": 10},
)

Listing Available Policies

curl $QBRIX_URL/api/v1/policies \
  -H "X-API-Key: $QBRIX_API_KEY" | jq .

Returns all policies with their configurable parameters and defaults.