A/B Tests

Overview

ze.choose() enables A/B testing by making weighted random selections between variants (models, prompts, parameters, etc.), timeboxing each experiment, and automatically tracking the chosen variant on the active span/trace/session for downstream analytics. Key features:

Weighted random selection between variants
Experiment timeboxing via duration_days
Automatic tracking of choices within spans, traces, or sessions
Consistency caching — same entity always gets the same variant
Built-in validation of weights, variant keys, and defaults
Automatic fallback to a default variant once an experiment completes

Basic Usage

import zeroeval as ze

ze.init()

# Must be called within a span, trace, or session context
with ze.span("my_operation"):
    # Choose between two models with 70/30 split for 14 days
    model = ze.choose(
        "model_selection",
        variants={"fast": "gpt-4o-mini", "powerful": "gpt-4o"},
        weights={"fast": 0.7, "powerful": 0.3},
        duration_days=14,
        default_variant="fast"  # optional fallback after day 14
    )
    
    # Use the selected model
    # model will be either "gpt-4o-mini" (70% chance) or "gpt-4o" (30% chance)

Parameters

Parameter	Type	Required	Description
`name`	`str`	Yes	Name of the A/B test (e.g., “model_selection”, “prompt_variant”)
`variants`	`Dict[str, Any]`	Yes	Dictionary mapping variant keys to their values
`weights`	`Dict[str, float]`	Yes	Dictionary mapping variant keys to selection probabilities (must sum to ~1.0)
`duration_days`	`int`	Yes	Number of days the experiment should run; must be > 0
`default_variant`	`str`	No	Variant key to use automatically once the experiment ends (defaults to the first key if omitted)

Returns

Returns the value from the selected variant (not the key).

Experiment Lifecycle & Defaults

duration_days timeboxes the experiment. Once the backend marks it completed, ze.choose() automatically serves the default_variant.
If default_variant is omitted, the first key in variants becomes the fallback.
When an experiment is still active, the same entity (span/trace/session) receives a cached, consistent variant choice.

Tracking Signals

Attach success metrics to the same span where ze.choose() runs so dashboards can correlate outcomes with variant performance:

with ze.span("recommendation_flow") as span:
    model = ze.choose(
        "reco_models_v2",
        variants={"mini": "gpt-4o-mini", "full": "gpt-4o"},
        weights={"mini": 0.6, "full": 0.4},
        duration_days=21,
        default_variant="mini",
    )
    
    score = run_inference(model)
    ze.set_signal(span, {"conversion_success": score > 0.75})

Complete Example

import zeroeval as ze
import openai

ze.init()
client = openai.OpenAI()

with ze.span("model_ab_test", tags={"feature": "model_comparison"}):
    # A/B test between two models
    selected_model = ze.choose(
        "model_selection",
        variants={
            "mini": "gpt-4o-mini",
            "full": "gpt-4o"
        },
        weights={
            "mini": 0.7,  # 70% traffic
            "full": 0.3   # 30% traffic
        },
        duration_days=14,
        default_variant="mini"
    )
    
    # The selected model is automatically tracked
    response = client.chat.completions.create(
        model=selected_model,
        messages=[{"role": "user", "content": "Hello!"}]
    )
    
    # Attach a success signal tied to this span/choice
    rating = evaluate_response(response)
    ze.set_signal(span, {"response_quality": rating >= 0.7})

Important Notes

Context Required: Must be called within an active ze.span(), trace, or session
Consistency: Same entity (span/trace/session) always receives the same variant while the test runs
Weight Validation: Weights should sum to 1.0 (warns if not within 0.95-1.05)
Duration Required: duration_days must be > 0; experiments stop after this window
Fallback Behavior: Once the backend reports the test as completed, default_variant is used automatically
Signal Analytics: Use ze.set_signal() on the same span to compare variant impact in the dashboard
Key Matching: Variant keys and weight keys must match exactly

Tracing

Autotune

Judges

Experiments

LLM Gateway

Overview

Basic Usage

Parameters

Returns

Experiment Lifecycle & Defaults

Tracking Signals

Complete Example

Important Notes

Tracing

Autotune

Judges

Experiments

LLM Gateway

​Overview

​Basic Usage

​Parameters

​Returns

​Experiment Lifecycle & Defaults

​Tracking Signals

​Complete Example

​Important Notes

Overview

Basic Usage

Parameters

Returns

Experiment Lifecycle & Defaults

Tracking Signals

Complete Example

Important Notes