Policies
qbrix ships 12 multi-armed bandit policies organized into three categories: stochastic, contextual, and adversarial. Each policy implements a different exploration-exploitation strategy.
Choosing a Policy
Answer a few questions to find the right policy for your use case:
Do you have per-request user features?
Stochastic Policies
These assume rewards are drawn from a stationary distribution. Best for standard A/B testing and optimization scenarios.
BetaTSPolicy
Thompson Sampling with Beta priors. The recommended default for binary rewards.
| Parameter | Type | Default | Description |
|---|---|---|---|
alpha_prior | float | 1.0 | Beta prior alpha (successes) |
beta_prior | float | 1.0 | Beta prior beta (failures) |
- Reward type: Binary (0 or 1)
- Best for: Click-through optimization, conversion rate testing
- Pros: Naturally balances exploration/exploitation, fast convergence
- Cons: Only supports binary rewards
GaussianTSPolicy
Thompson Sampling with Gaussian priors. For continuous reward values.
| Parameter | Type | Default | Description |
|---|---|---|---|
mu_prior | float | 0.0 | Prior mean |
sigma_prior | float | 1.0 | Prior standard deviation |
- Reward type: Continuous (any float)
- Best for: Revenue optimization, time-on-page, engagement scores
- Pros: Handles continuous rewards, principled Bayesian updates
- Cons: Assumes Gaussian reward distribution
UCB1TunedPolicy
Upper Confidence Bound with tuned variance. Deterministic, no randomness in selection.
| Parameter | Type | Default | Description |
|---|---|---|---|
| — | — | — | No configurable parameters |
- Reward type: Continuous
- Best for: When you want deterministic, reproducible selections
- Pros: Strong theoretical guarantees, no randomness
- Cons: Can over-explore in practice
KLUCBPolicy
KL-divergence based Upper Confidence Bound. Optimal for Bernoulli rewards.
| Parameter | Type | Default | Description |
|---|---|---|---|
c | float | 0.0 | Exploration constant |
- Reward type: Binary
- Best for: Binary rewards when you want minimax-optimal regret
- Pros: Asymptotically optimal for Bernoulli bandits
- Cons: More computationally expensive than BetaTS
EpsilonPolicy
Epsilon-greedy. The simplest bandit algorithm. Explores with probability epsilon, exploits otherwise.
| Parameter | Type | Default | Description |
|---|---|---|---|
epsilon | float | 0.1 | Exploration probability (0-1) |
- Reward type: Any
- Best for: Baselines, simple scenarios, when you want explicit control over exploration rate
- Pros: Dead simple, easy to reason about
- Cons: Wastes exploration budget on known-bad arms
MOSSPolicy
Minimax Optimal Strategy in the Stochastic case. Requires knowing the time horizon in advance.
| Parameter | Type | Default | Description |
|---|---|---|---|
n_horizon | int | 1000 | Total number of rounds |
- Reward type: Continuous
- Best for: Fixed-duration campaigns where the total rounds are known
- Pros: Minimax-optimal regret bound
- Cons: Requires specifying horizon upfront
MOSSAnyTimePolicy
Anytime variant of MOSS. No need to specify the horizon.
| Parameter | Type | Default | Description |
|---|---|---|---|
alpha | float | 2.0 | Exploration parameter |
- Reward type: Continuous
- Best for: Open-ended optimization without a known end date
- Pros: No horizon needed, near-optimal regret
- Cons: Slightly worse constant than MOSS with known horizon
Contextual Policies
These use per-request feature vectors to personalize selections. The context vector is passed with each select request.
LinUCBPolicy
Linear Upper Confidence Bound. Models reward as a linear function of context features.
| Parameter | Type | Default | Description |
|---|---|---|---|
alpha | float | 1.0 | Exploration parameter |
context_dim | int | — | Dimension of context vector (required) |
- Reward type: Continuous
- Best for: Personalized recommendations with user features
- Pros: Deterministic, strong theoretical guarantees
- Cons: Assumes linear reward model
LinTSPolicy
Linear Thompson Sampling. Bayesian approach to contextual bandits.
| Parameter | Type | Default | Description |
|---|---|---|---|
sigma | float | 1.0 | Prior variance |
context_dim | int | — | Dimension of context vector (required) |
- Reward type: Continuous
- Best for: Personalized recommendations when you want randomized exploration
- Pros: Better empirical performance than LinUCB in many settings
- Cons: Assumes linear reward model, more computation per selection
Adversarial Policies
These make no assumptions about how rewards are generated. Use when rewards may be non-stationary or adversarially chosen.
EXP3Policy
Exponential-weight algorithm for Exploration and Exploitation. Uses multiplicative weight updates.
| Parameter | Type | Default | Description |
|---|---|---|---|
gamma | float | 0.1 | Exploration mixing parameter (0-1) |
- Reward type: Any (bounded)
- Best for: Non-stationary environments, game-theoretic settings
- Pros: Works against any reward sequence
- Cons: Higher regret than stochastic methods when rewards are actually stationary
FPLPolicy
Follow the Perturbed Leader. Adds random perturbations to cumulative rewards.
| Parameter | Type | Default | Description |
|---|---|---|---|
eta | float | 1.0 | Perturbation scale |
- Reward type: Any (bounded)
- Best for: Adversarial settings when you want a perturbation-based approach
- Pros: Simple implementation, competitive with EXP3
- Cons: Requires tuning eta for best performance
Example: Creating an Experiment with Policy Params
from qbrix import Qbrix
client = Qbrix()
experiment = client.experiment.create(
name="personalized-pricing",
pool_id="<pool-id>",
policy="LinTSPolicy",
policy_params={"sigma": 0.5, "context_dim": 10},
)curl -X POST $QBRIX_URL/api/v1/experiments \
-H "X-API-Key: $QBRIX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "personalized-pricing",
"pool_id": "<pool-id>",
"policy": "LinTSPolicy",
"policy_params": {
"sigma": 0.5,
"context_dim": 10
}
}'Listing Available Policies
curl $QBRIX_URL/api/v1/policies \
-H "X-API-Key: $QBRIX_API_KEY" | jq .Returns all policies with their configurable parameters and defaults.