Reproducibility
How Pramana achieves (and where it can't achieve) deterministic outputs.
Why reproducibility matters
Drift detection requires comparing outputs across time and users. If two runs of the same prompt produce different outputs due to sampling randomness (not a model change), we can't distinguish real drift from noise.
Pramana defaults
| Parameter | Default | Purpose |
|---|---|---|
temperature | 0.0 | Eliminates sampling randomness |
seed | 42 | Fixed RNG state for providers that support it |
Provider comparison
| Provider | Temperature | Seed | Reproducibility | Recommended |
|---|---|---|---|---|
| OpenAI API | enforced | enforced | high | Yes — scientific drift detection |
| Anthropic API | enforced | ignored | low | No — non-deterministic even at temp=0 |
| Claude Code | hint only | N/A | low | No — uses temp=1.0 by default |
| Google Gemini | enforced | enforced | TBD | Under evaluation |
OpenAI
Full reproducibility support. Same seed + same system_fingerprint = identical output.
response = await client.chat.completions.create(
model="gpt-5.2",
messages=[{"role": "user", "content": "prompt"}],
temperature=0.0,
seed=42
)
system_fingerprintchanges only on infrastructure updates (rare)- Temperature and seed are enforced at the API level
- Some reasoning models (o1, o3) reject these parameters — Pramana handles this by retrying without them and logging a warning
Source: OpenAI Cookbook
Anthropic
The API accepts a seed parameter but silently ignores it. Official documentation states: "even with temperature=0.0, results will not be fully deterministic."
response = await client.messages.create(
model="claude-opus-4-6",
messages=[{"role": "user", "content": "prompt"}],
temperature=0.0
# seed not supported
)
- No
system_fingerprintequivalent - No reproducibility guarantees at any temperature
Source: Anthropic API docs
Claude Code (subscription)
Uses temperature=1.0 by default. Parameters are passed as text hints in the prompt, not enforced by the API.
- No temperature or seed parameters in the SDK
- Non-deterministic by design
- Suitable for exploratory testing only
Recommendations
For scientific drift detection
export OPENAI_API_KEY="sk-..."
pramana run --tier comprehensive --model gpt-5.2 --temperature 0.0 --seed 42
For exploratory testing
# Any provider works, but results are not reproducible
pramana run --tier cheap --model claude-opus-4-6
Consistent defaults ≠ reproducibility
Claude Code uses temperature=1.0 for all users (a consistent default). But consistent defaults do not mean reproducible outputs — the sampling process is still non-deterministic. Two runs with identical inputs and parameters can produce different outputs.
Verifying reproducibility
Run the test suite to verify assertion logic and provider wiring:
pytest tests/
For empirical variance measurement, run the same eval multiple times and compare result hashes.