Reproducibility

How Pramana achieves (and where it can't achieve) deterministic outputs.

Why reproducibility matters

Drift detection requires comparing outputs across time and users. If two runs of the same prompt produce different outputs due to sampling randomness (not a model change), we can't distinguish real drift from noise.

Pramana defaults

ParameterDefaultPurpose
temperature0.0Eliminates sampling randomness
seed42Fixed RNG state for providers that support it

Provider comparison

ProviderTemperatureSeedReproducibilityRecommended
OpenAI API enforced enforced high Yes — scientific drift detection
Anthropic API enforced ignored low No — non-deterministic even at temp=0
Claude Code hint only N/A low No — uses temp=1.0 by default
Google Gemini enforced enforced TBD Under evaluation

OpenAI

Full reproducibility support. Same seed + same system_fingerprint = identical output.

response = await client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "prompt"}],
    temperature=0.0,
    seed=42
)

Source: OpenAI Cookbook

Anthropic

The API accepts a seed parameter but silently ignores it. Official documentation states: "even with temperature=0.0, results will not be fully deterministic."

response = await client.messages.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "prompt"}],
    temperature=0.0
    # seed not supported
)

Source: Anthropic API docs

Claude Code (subscription)

Uses temperature=1.0 by default. Parameters are passed as text hints in the prompt, not enforced by the API.

Recommendations

For scientific drift detection

export OPENAI_API_KEY="sk-..."
pramana run --tier comprehensive --model gpt-5.2 --temperature 0.0 --seed 42

For exploratory testing

# Any provider works, but results are not reproducible
pramana run --tier cheap --model claude-opus-4-6
Warning: Results from Anthropic API and Claude Code subscription are not reproducible and should not be used for drift detection research.

Consistent defaults ≠ reproducibility

Claude Code uses temperature=1.0 for all users (a consistent default). But consistent defaults do not mean reproducible outputs — the sampling process is still non-deterministic. Two runs with identical inputs and parameters can produce different outputs.

Verifying reproducibility

Run the test suite to verify assertion logic and provider wiring:

pytest tests/

For empirical variance measurement, run the same eval multiple times and compare result hashes.