Auto Policy
The auto policy is qbrix's recommended default. Instead of asking you to choose between Thompson Sampling, UCB, EXP3, LinTS, etc., it runs a portfolio of learners in parallel and uses a meta-level controller to adaptively route traffic toward whichever learner is performing best on your real data.
Why Use It
Picking the right bandit algorithm is a research problem in disguise — the "best" choice depends on reward stationarity, arm count, context dimension, and assumptions you don't always know upfront. The auto policy turns that decision into a runtime experiment:
- Zero-choice operation. Pass
policy="auto"and a reward type. Qbrix builds the portfolio. - Adaptive. Learners that perform well get more traffic; learners that under-perform fade out.
- Robust to drift. If your reward distribution shifts and a different learner becomes optimal, the meta-controller catches up.
- Easy escape hatch. You can always switch to a concrete policy later.
How It Works
When you create an experiment with policy="auto", qbrix actually creates N+1 experiments:
parent meta experiment → MetaBanditPolicy (EXP3 at the meta level)
├─ learner_0 → BetaTSPolicy
├─ learner_1 → UCB1TunedPolicy
├─ learner_2 → EpsilonPolicy
└─ ...
Every select call walks two layers:
- Meta layer — EXP3 picks one learner based on its current weight.
- Learner layer — that learner picks an arm using its own algorithm.
Feedback is credited to both layers. The learner updates its internal parameters, and the meta-controller updates the EXP3 weight for the learner that produced the selection.
The portfolio is scoped automatically to your reward type and context settings — for binary rewards without context, you'll get stochastic learners; for binary rewards with use_context=true and a dim, you'll get LinTS, LogisticTS, GLMUCB, etc.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
reward_type | string | — | binary, bounded, or continuous. Scopes the learner portfolio. |
use_context | bool | false | If true, contextual learners are added to the portfolio. |
dim | int | — | Context vector dimension. Required when use_context=true. |
Examples
client.experiment.create(
name="checkout-banner",
pool_id=pool.id,
policy="auto",
policy_params={"reward_type": "binary"},
)client.experiment.create(
name="personalized-hero",
pool_id=pool.id,
policy="auto",
policy_params={
"reward_type": "binary",
"use_context": True,
"dim": 16,
},
)client.experiment.create(
name="revenue-test",
pool_id=pool.id,
policy="auto",
policy_params={"reward_type": "continuous"},
)When NOT to Use Auto
- You need full control over a single algorithm for reproducibility, audit, or research purposes.
- You're squeezing every bit of regret — the meta layer adds a small constant amount of exploration overhead vs. running the single best algorithm in hindsight.
- You have very few requests. With < a few thousand selections per experiment, the meta-controller may not have enough signal to confidently shift weights.
For all other cases, auto is the recommended starting point. You can always switch to a concrete policy later by creating a new experiment.
Inspecting the Portfolio
When you create an auto experiment in the console, you'll see the parent meta experiment in your experiment list. Open it to view the learner experiments, their individual reward histories, and the current EXP3 weight assigned to each.
Next Steps
- Policies — the full algorithm catalog
- Feedback & Rewards — how reward signals propagate through the meta layer
- Console Experiments — run and monitor auto experiments in the console