Test Suites
JSONL format, assertion types, tiers, and categories.
JSONL format
Each test suite is a JSONL file (one JSON object per line).
{
"id": "reasoning-001",
"category": "reasoning",
"input": "If 5 cats catch 5 mice in 5 minutes, how long for 100 cats to catch 100 mice?",
"ideal": ["5 minutes", "5 mins"],
"assertion": {
"type": "contains_any",
"case_sensitive": false
},
"metadata": {
"difficulty": "medium",
"tokens_est": 40,
"tags": ["math", "logic"]
}
}
Field reference
| Field | Type | Description |
|---|---|---|
id | string | Unique ID: {category}-{num} |
category | enum | One of 6 categories (see below) |
input | string | message[] | Prompt text or chat message array |
ideal | string | string[] | null | Expected output(s) for matching |
assertion.type | enum | Assertion type (see below) |
assertion.case_sensitive | bool | Default: true |
assertion.threshold | float | null | For semantic_similarity |
assertion.judge_prompt | string | null | For llm_judge |
metadata.difficulty | easy | medium | hard | Difficulty rating |
metadata.tokens_est | int | Estimated token count |
metadata.tags | string[] | Optional tags |
Assertion types
Source: assertions.py
| Type | Behavior | Status |
|---|---|---|
exact_match |
Output (stripped) must equal ideal. Respects case_sensitive. |
implemented |
contains |
Output must contain ideal as substring. |
implemented |
contains_any |
Output must contain at least one string from ideal[]. |
implemented |
is_json |
Output must be valid JSON. Strips markdown fences before parsing. | implemented |
llm_judge |
Uses the same provider as judge. Sends output + judge_prompt, expects YES/NO. |
implemented |
semantic_similarity |
Embedding-based similarity against ideal with threshold. |
not implemented |
Suite tiers
Data from manifest.toml:
| Tier | Tests | Est. cost | Est. tokens | Use case |
|---|---|---|---|---|
cheap |
10 | $0.05 | 350 | Smoke test, CI gates |
moderate |
25 | $0.25 | 950 | Regular monitoring |
comprehensive |
75 | $1.50 | 2,800 | Full evaluation |
Categories
| Category | Examples |
|---|---|
reasoning | Logic puzzles, math, deductive chains |
factual | Knowledge recall, date/number questions |
instruction_following | Format compliance, constraint adherence |
coding | Code generation, debugging, JSON output |
safety | Refusal of harmful requests, boundary testing |
creative | Open-ended generation (judged by LLM) |
Suite versioning
Suites are immutable. Once published, a suite version never changes. New tests go into a new version.
Each suite has a SHA-256 hash computed from sorted test case IDs and content. This hash is included in every submission for integrity verification.
from pramana.hash import hash_suite
print(hash_suite("src/pramana/suites/v1.0/cheap.jsonl"))
# sha256:a1b2c3...
File layout
src/pramana/suites/v1.0/
├── manifest.toml # Tier metadata, categories, assertion types
├── cheap.jsonl # 10 tests
├── moderate.jsonl # 25 tests
└── comprehensive.jsonl # 75 tests