qbrixqbrix

Policies

qbrix ships 12 multi-armed bandit policies organized into three categories: stochastic, contextual, and adversarial. Each policy implements a different exploration-exploitation strategy.

Choosing a Policy

Answer a few questions to find the right policy for your use case:

Do you have per-request user features?

Stochastic Policies

These assume rewards are drawn from a stationary distribution. Best for standard A/B testing and optimization scenarios.

BetaTSPolicy

Thompson Sampling with Beta priors. The recommended default for binary rewards.

ParameterTypeDefaultDescription
alpha_priorfloat1.0Beta prior alpha (successes)
beta_priorfloat1.0Beta prior beta (failures)
  • Reward type: Binary (0 or 1)
  • Best for: Click-through optimization, conversion rate testing
  • Pros: Naturally balances exploration/exploitation, fast convergence
  • Cons: Only supports binary rewards

GaussianTSPolicy

Thompson Sampling with Gaussian priors. For continuous reward values.

ParameterTypeDefaultDescription
mu_priorfloat0.0Prior mean
sigma_priorfloat1.0Prior standard deviation
  • Reward type: Continuous (any float)
  • Best for: Revenue optimization, time-on-page, engagement scores
  • Pros: Handles continuous rewards, principled Bayesian updates
  • Cons: Assumes Gaussian reward distribution

UCB1TunedPolicy

Upper Confidence Bound with tuned variance. Deterministic, no randomness in selection.

ParameterTypeDefaultDescription
No configurable parameters
  • Reward type: Continuous
  • Best for: When you want deterministic, reproducible selections
  • Pros: Strong theoretical guarantees, no randomness
  • Cons: Can over-explore in practice

KLUCBPolicy

KL-divergence based Upper Confidence Bound. Optimal for Bernoulli rewards.

ParameterTypeDefaultDescription
cfloat0.0Exploration constant
  • Reward type: Binary
  • Best for: Binary rewards when you want minimax-optimal regret
  • Pros: Asymptotically optimal for Bernoulli bandits
  • Cons: More computationally expensive than BetaTS

EpsilonPolicy

Epsilon-greedy. The simplest bandit algorithm. Explores with probability epsilon, exploits otherwise.

ParameterTypeDefaultDescription
epsilonfloat0.1Exploration probability (0-1)
  • Reward type: Any
  • Best for: Baselines, simple scenarios, when you want explicit control over exploration rate
  • Pros: Dead simple, easy to reason about
  • Cons: Wastes exploration budget on known-bad arms

MOSSPolicy

Minimax Optimal Strategy in the Stochastic case. Requires knowing the time horizon in advance.

ParameterTypeDefaultDescription
n_horizonint1000Total number of rounds
  • Reward type: Continuous
  • Best for: Fixed-duration campaigns where the total rounds are known
  • Pros: Minimax-optimal regret bound
  • Cons: Requires specifying horizon upfront

MOSSAnyTimePolicy

Anytime variant of MOSS. No need to specify the horizon.

ParameterTypeDefaultDescription
alphafloat2.0Exploration parameter
  • Reward type: Continuous
  • Best for: Open-ended optimization without a known end date
  • Pros: No horizon needed, near-optimal regret
  • Cons: Slightly worse constant than MOSS with known horizon

Contextual Policies

These use per-request feature vectors to personalize selections. The context vector is passed with each select request.

LinUCBPolicy

Linear Upper Confidence Bound. Models reward as a linear function of context features.

ParameterTypeDefaultDescription
alphafloat1.0Exploration parameter
context_dimintDimension of context vector (required)
  • Reward type: Continuous
  • Best for: Personalized recommendations with user features
  • Pros: Deterministic, strong theoretical guarantees
  • Cons: Assumes linear reward model

LinTSPolicy

Linear Thompson Sampling. Bayesian approach to contextual bandits.

ParameterTypeDefaultDescription
sigmafloat1.0Prior variance
context_dimintDimension of context vector (required)
  • Reward type: Continuous
  • Best for: Personalized recommendations when you want randomized exploration
  • Pros: Better empirical performance than LinUCB in many settings
  • Cons: Assumes linear reward model, more computation per selection

Adversarial Policies

These make no assumptions about how rewards are generated. Use when rewards may be non-stationary or adversarially chosen.

EXP3Policy

Exponential-weight algorithm for Exploration and Exploitation. Uses multiplicative weight updates.

ParameterTypeDefaultDescription
gammafloat0.1Exploration mixing parameter (0-1)
  • Reward type: Any (bounded)
  • Best for: Non-stationary environments, game-theoretic settings
  • Pros: Works against any reward sequence
  • Cons: Higher regret than stochastic methods when rewards are actually stationary

FPLPolicy

Follow the Perturbed Leader. Adds random perturbations to cumulative rewards.

ParameterTypeDefaultDescription
etafloat1.0Perturbation scale
  • Reward type: Any (bounded)
  • Best for: Adversarial settings when you want a perturbation-based approach
  • Pros: Simple implementation, competitive with EXP3
  • Cons: Requires tuning eta for best performance

Example: Creating an Experiment with Policy Params

from qbrix import Qbrix
 
client = Qbrix()
 
experiment = client.experiment.create(
    name="personalized-pricing",
    pool_id="<pool-id>",
    policy="LinTSPolicy",
    policy_params={"sigma": 0.5, "context_dim": 10},
)

Listing Available Policies

curl $QBRIX_URL/api/v1/policies \
  -H "X-API-Key: $QBRIX_API_KEY" | jq .

Returns all policies with their configurable parameters and defaults.