Policies

qbrix ships 16 multi-armed bandit policies organized into three categories: stochastic, contextual, and adversarial. Each policy implements a different exploration-exploitation strategy. If you don't want to pick one yourself, the auto meta-bandit does it for you.

Auto (Meta-Bandit)

The auto policy is the recommended default. Instead of committing to a single algorithm, qbrix launches a portfolio of learners in parallel and uses a meta-level EXP3 controller to adaptively route traffic toward whichever learner is performing best on your actual data. The portfolio is scoped automatically to your reward type and context settings.

Under the hood:

A parent meta experiment runs MetaBanditPolicy (EXP3 at the meta level).
Several learner experiments run concrete policies from the appropriate category (e.g. BetaTSPolicy, UCB1TunedPolicy, EpsilonPolicy for binary rewards; LinTSPolicy, LogisticTSPolicy, GLMUCBPolicy for contextual settings).
Every select call is first routed by the meta controller to one of the learners, which then picks an arm. Feedback is credited to both levels.
As rewards come in, EXP3 shifts meta-level weight toward the learners that convert best — so the portfolio self-tunes to your environment.

Parameter	Type	Default	Description
`reward_type`	string	—	`binary`, `bounded`, or `continuous`. Scopes the learner portfolio.
`use_context`	bool	`false`	If `true`, contextual learners (LinTS, LogisticTS, GLMUCB) are included.
`dim`	int	—	Context vector dimension. Required when `use_context=true`.

Best for: Most production use cases. When you don't know which algorithm is right, or when reward structure may shift over time.
Pros: Zero-choice operation; robust across regimes; learners that underperform automatically get less traffic.
Cons: Slightly more infrastructure (M+1 experiments per parent); meta-level exploration adds a small amount of regret vs. the single best algorithm in hindsight.

Example: Auto Policy

from qbrix import Qbrix
 
client = Qbrix()
 
# binary reward, no context — qbrix picks a stochastic portfolio
experiment = client.experiment.create(
    name="checkout-banner",
    pool_id=pool.id,
    policy="auto",
    policy_params={"reward_type": "binary"},
)
 
# contextual auto — adds LinTS / LogisticTS / GLMUCB to the portfolio
personalized = client.experiment.create(
    name="personalized-hero",
    pool_id=pool.id,
    policy="auto",
    policy_params={
        "reward_type": "binary",
        "use_context": True,
        "dim": 10,
    },
)

curl -X POST $QBRIX_URL/api/v1/experiments \
  -H "X-API-Key: $QBRIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout-banner",
    "pool_id": "<pool-id>",
    "policy": "auto",
    "policy_params": {"reward_type": "binary"}
  }'

Info

You can always override auto with a specific policy name if you want full control. The concrete policies are documented below, and the flowchart can help you choose.

Choosing a Policy

Prefer to pick a concrete algorithm yourself? Answer a few questions to find the right one:

Do you have per-request user features?

Stochastic Policies

These assume rewards are drawn from a stationary distribution. Best for standard A/B testing and optimization scenarios.

BetaTSPolicy

Thompson Sampling with Beta priors. The recommended default for binary rewards.

Parameter	Type	Default	Description
`alpha_prior`	float	1.0	Beta prior alpha (successes)
`beta_prior`	float	1.0	Beta prior beta (failures)

Reward type: Binary (0 or 1)
Best for: Click-through optimization, conversion rate testing
Pros: Naturally balances exploration/exploitation, fast convergence
Cons: Only supports binary rewards

DiscountedTSPolicy

Discounted Thompson Sampling for non-stationary environments. Geometrically decays old observations by gamma on each update. Effective memory window ≈ 1 / (1 - gamma).

Parameter	Type	Default	Description
`alpha_prior`	float	1.0	Beta prior alpha (successes)
`beta_prior`	float	1.0	Beta prior beta (failures)
`gamma`	float	—	Discount factor, required (0 < gamma < 1)

Reward type: Binary or Bounded
Best for: Environments where reward distributions shift over time (seasonal trends, changing user behaviour)
Pros: Adapts to distribution shifts without resetting; tunable memory window via gamma
Cons: Requires choosing gamma; too-low gamma forgets useful history, too-high gamma is slow to adapt

GaussianTSPolicy

Thompson Sampling with Gaussian priors. For continuous reward values.

Parameter	Type	Default	Description
`mu_prior`	float	0.0	Prior mean
`sigma_prior`	float	1.0	Prior standard deviation

Reward type: Continuous (any float)
Best for: Revenue optimization, time-on-page, engagement scores
Pros: Handles continuous rewards, principled Bayesian updates
Cons: Assumes Gaussian reward distribution

UCB1TunedPolicy

Upper Confidence Bound with tuned variance. Deterministic, no randomness in selection.

Parameter	Type	Default	Description
—	—	—	No configurable parameters

Reward type: Continuous
Best for: When you want deterministic, reproducible selections
Pros: Strong theoretical guarantees, no randomness
Cons: Can over-explore in practice

KLUCBPolicy

KL-divergence based Upper Confidence Bound. Optimal for Bernoulli rewards.

Parameter	Type	Default	Description
`c`	float	0.0	Exploration constant

Reward type: Binary
Best for: Binary rewards when you want minimax-optimal regret
Pros: Asymptotically optimal for Bernoulli bandits
Cons: More computationally expensive than BetaTS

EpsilonPolicy

Epsilon-greedy. The simplest bandit algorithm. Explores with probability epsilon, exploits otherwise.

Parameter	Type	Default	Description
`epsilon`	float	0.1	Exploration probability (0-1)

Reward type: Any
Best for: Baselines, simple scenarios, when you want explicit control over exploration rate
Pros: Dead simple, easy to reason about
Cons: Wastes exploration budget on known-bad arms

MOSSPolicy

Minimax Optimal Strategy in the Stochastic case. Requires knowing the time horizon in advance.

Parameter	Type	Default	Description
`n_horizon`	int	1000	Total number of rounds

Reward type: Continuous
Best for: Fixed-duration campaigns where the total rounds are known
Pros: Minimax-optimal regret bound
Cons: Requires specifying horizon upfront

MOSSAnyTimePolicy

Anytime variant of MOSS. No need to specify the horizon.

Parameter	Type	Default	Description
`alpha`	float	2.0	Exploration parameter

Reward type: Continuous
Best for: Open-ended optimization without a known end date
Pros: No horizon needed, near-optimal regret
Cons: Slightly worse constant than MOSS with known horizon

RandomPolicy

Uniform random arm selection. No learning. Used as an A/B testing baseline, holdout control, or warm-start data collection phase.

Parameter	Type	Default	Description
—	—	—	No configurable parameters

Reward type: Binary, Bounded, or Continuous
Best for: Establishing a baseline, pure random holdout groups, collecting initial data before switching to a learning policy
Pros: Zero bias, trivial to reason about, compatible with all reward types
Cons: No learning — use only as a short-term baseline or control group

Contextual Policies

These use per-request feature vectors to personalize selections. The context vector is passed with each select request.

LinUCBPolicy

Linear Upper Confidence Bound. Models reward as a linear function of context features.

Parameter	Type	Default	Description
`alpha`	float	1.0	Exploration parameter
`context_dim`	int	—	Dimension of context vector (required)

Reward type: Continuous
Best for: Personalized recommendations with user features
Pros: Deterministic, strong theoretical guarantees
Cons: Assumes linear reward model

LinTSPolicy

Linear Thompson Sampling. Bayesian approach to contextual bandits.

Parameter	Type	Default	Description
`sigma`	float	1.0	Prior variance
`context_dim`	int	—	Dimension of context vector (required)

Reward type: Continuous
Best for: Personalized recommendations when you want randomized exploration
Pros: Better empirical performance than LinUCB in many settings
Cons: Assumes linear reward model, more computation per selection

LogisticTSPolicy

Laplace-approximated Logistic Thompson Sampling. Maintains per-arm weight vectors and a diagonal Hessian approximation. At selection time, samples from the posterior N(w, diag(1/h)) and scores each arm via the logistic function.

Parameter	Type	Default	Description
`dim`	int	—	Dimension of context vector (required)
`lambda_`	float	1.0	L2 regularization strength
`lr`	float	0.1	Learning rate for weight updates

Reward type: Binary (0 or 1)
Best for: Ads, recommendations, personalization — any binary outcome with contextual features
Pros: Most widely deployed contextual bandit for binary rewards; principled Bayesian exploration via posterior sampling
Cons: Diagonal Hessian is an approximation; may under-explore when posterior is poorly calibrated early on

GLMUCBPolicy

GLM-UCB (Logistic UCB). Fits a logistic regression model per arm and selects using an upper confidence bound based on the logistic mean plus a scaled confidence width from the inverse Hessian.

Parameter	Type	Default	Description
`dim`	int	—	Dimension of context vector (required)
`alpha`	float	1.5	Exploration coefficient for UCB term
`lambda_`	float	1.0	L2 regularization strength
`lr`	float	0.1	Learning rate for weight updates

Reward type: Binary (0 or 1)
Best for: Binary rewards where deterministic, auditable selections are preferred over randomized sampling
Pros: Deterministic and easy to monitor; strong theoretical regret bounds for logistic models
Cons: Can over-explore when alpha is too large; tuning alpha requires care in production

Adversarial Policies

These make no assumptions about how rewards are generated. Use when rewards may be non-stationary or adversarially chosen.

EXP3Policy

Exponential-weight algorithm for Exploration and Exploitation. Uses multiplicative weight updates.

Parameter	Type	Default	Description
`gamma`	float	0.1	Exploration mixing parameter (0-1)

Reward type: Any (bounded)
Best for: Non-stationary environments, game-theoretic settings
Pros: Works against any reward sequence
Cons: Higher regret than stochastic methods when rewards are actually stationary

EXP3IXPolicy

EXP3 with Implicit Exploration (Neu, 2015). Uses reward / (p + gamma) instead of reward / p for importance-weighted updates, eliminating the numerical instability of standard EXP3 when selection probabilities are small.

Parameter	Type	Default	Description
`gamma`	float	0.1	Implicit exploration parameter
`eta`	float	0.1	Learning rate for weight updates

Reward type: Binary or Bounded
Best for: Adversarial or non-stationary settings where standard EXP3 shows numerical instability; production environments with many arms
Pros: Same worst-case regret guarantees as EXP3 but numerically stable; no reward clipping needed
Cons: Two hyperparameters to tune instead of one; marginal overhead vs. EXP3

FPLPolicy

Follow the Perturbed Leader. Adds random perturbations to cumulative rewards.

Parameter	Type	Default	Description
`eta`	float	1.0	Perturbation scale

Reward type: Any (bounded)
Best for: Adversarial settings when you want a perturbation-based approach
Pros: Simple implementation, competitive with EXP3
Cons: Requires tuning eta for best performance

Example: Creating an Experiment with Policy Params

from qbrix import Qbrix
 
client = Qbrix()
 
experiment = client.experiment.create(
    name="personalized-pricing",
    pool_id="<pool-id>",
    policy="LinTSPolicy",
    policy_params={"sigma": 0.5, "context_dim": 10},
)

curl -X POST $QBRIX_URL/api/v1/experiments \
  -H "X-API-Key: $QBRIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "personalized-pricing",
    "pool_id": "<pool-id>",
    "policy": "LinTSPolicy",
    "policy_params": {
      "sigma": 0.5,
      "context_dim": 10
    }
  }'

Listing Available Policies

curl $QBRIX_URL/api/v1/policies \
  -H "X-API-Key: $QBRIX_API_KEY" | jq .

Returns all policies with their configurable parameters and defaults.