Reproducibility

How Pramana achieves (and where it can't achieve) deterministic outputs.

Why reproducibility matters

Drift detection requires comparing outputs across time and users. If two runs of the same prompt produce different outputs due to sampling randomness (not a model change), we can't distinguish real drift from noise.

Pramana defaults

Parameter	Default	Purpose
`temperature`	`0.0`	Eliminates sampling randomness
`seed`	`42`	Fixed RNG state for providers that support it

Provider comparison

Provider	Temperature	Seed	Reproducibility	Recommended
OpenAI API	enforced	enforced	high	Yes — scientific drift detection
Anthropic API	enforced	ignored	low	No — non-deterministic even at temp=0
Claude Code	hint only	N/A	low	No — uses temp=1.0 by default
Google Gemini	enforced	enforced	TBD	Under evaluation

OpenAI

Full reproducibility support. Same seed + same system_fingerprint = identical output.

response = await client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "prompt"}],
    temperature=0.0,
    seed=42
)

system_fingerprint changes only on infrastructure updates (rare)
Temperature and seed are enforced at the API level
Some reasoning models (o1, o3) reject these parameters — Pramana handles this by retrying without them and logging a warning

Source: OpenAI Cookbook

Anthropic

The API accepts a seed parameter but silently ignores it. Official documentation states: "even with temperature=0.0, results will not be fully deterministic."

response = await client.messages.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "prompt"}],
    temperature=0.0
    # seed not supported
)

No system_fingerprint equivalent
No reproducibility guarantees at any temperature

Source: Anthropic API docs

Claude Code (subscription)

Uses temperature=1.0 by default. Parameters are passed as text hints in the prompt, not enforced by the API.

No temperature or seed parameters in the SDK
Non-deterministic by design
Suitable for exploratory testing only

Recommendations

For scientific drift detection

export OPENAI_API_KEY="sk-..."
pramana run --tier comprehensive --model gpt-5.2 --temperature 0.0 --seed 42

For exploratory testing

# Any provider works, but results are not reproducible
pramana run --tier cheap --model claude-opus-4-6

Warning: Results from Anthropic API and Claude Code subscription are not reproducible and should not be used for drift detection research.

Consistent defaults ≠ reproducibility

Claude Code uses temperature=1.0 for all users (a consistent default). But consistent defaults do not mean reproducible outputs — the sampling process is still non-deterministic. Two runs with identical inputs and parameters can produce different outputs.

Verifying reproducibility

Run the test suite to verify assertion logic and provider wiring:

pytest tests/

For empirical variance measurement, run the same eval multiple times and compare result hashes.