Test Suites

JSONL format, assertion types, tiers, and categories.

JSONL format

Each test suite is a JSONL file (one JSON object per line).

{
  "id": "reasoning-001",
  "category": "reasoning",
  "input": "If 5 cats catch 5 mice in 5 minutes, how long for 100 cats to catch 100 mice?",
  "ideal": ["5 minutes", "5 mins"],
  "assertion": {
    "type": "contains_any",
    "case_sensitive": false
  },
  "metadata": {
    "difficulty": "medium",
    "tokens_est": 40,
    "tags": ["math", "logic"]
  }
}

Field reference

Field	Type	Description
`id`	`string`	Unique ID: `{category}-{num}`
`category`	`enum`	One of 6 categories (see below)
`input`	`string \| message[]`	Prompt text or chat message array
`ideal`	`string \| string[] \| null`	Expected output(s) for matching
`assertion.type`	`enum`	Assertion type (see below)
`assertion.case_sensitive`	`bool`	Default: `true`
`assertion.threshold`	`float \| null`	For `semantic_similarity`
`assertion.judge_prompt`	`string \| null`	For `llm_judge`
`metadata.difficulty`	`easy \| medium \| hard`	Difficulty rating
`metadata.tokens_est`	`int`	Estimated token count
`metadata.tags`	`string[]`	Optional tags

Assertion types

Source: assertions.py

Type	Behavior	Status
`exact_match`	Output (stripped) must equal `ideal`. Respects `case_sensitive`.	implemented
`contains`	Output must contain `ideal` as substring.	implemented
`contains_any`	Output must contain at least one string from `ideal[]`.	implemented
`is_json`	Output must be valid JSON. Strips markdown fences before parsing.	implemented
`llm_judge`	Uses the same provider as judge. Sends output + `judge_prompt`, expects YES/NO.	implemented
`semantic_similarity`	Embedding-based similarity against `ideal` with `threshold`.	not implemented

Suite tiers

Data from manifest.toml:

Tier	Tests	Est. cost	Est. tokens	Use case
`cheap`	10	$0.05	350	Smoke test, CI gates
`moderate`	25	$0.25	950	Regular monitoring
`comprehensive`	75	$1.50	2,800	Full evaluation

Category	Examples
`reasoning`	Logic puzzles, math, deductive chains
`factual`	Knowledge recall, date/number questions
`instruction_following`	Format compliance, constraint adherence
`coding`	Code generation, debugging, JSON output
`safety`	Refusal of harmful requests, boundary testing
`creative`	Open-ended generation (judged by LLM)

Suite versioning

Suites are immutable. Once published, a suite version never changes. New tests go into a new version.

Each suite has a SHA-256 hash computed from sorted test case IDs and content. This hash is included in every submission for integrity verification.

from pramana.hash import hash_suite
print(hash_suite("src/pramana/suites/v1.0/cheap.jsonl"))
# sha256:a1b2c3...

File layout

src/pramana/suites/v1.0/
├── manifest.toml     # Tier metadata, categories, assertion types
├── cheap.jsonl       # 10 tests
├── moderate.jsonl    # 25 tests
└── comprehensive.jsonl  # 75 tests