Test Suites

JSONL format, assertion types, tiers, and categories.

JSONL format

Each test suite is a JSONL file (one JSON object per line).

{
  "id": "reasoning-001",
  "category": "reasoning",
  "input": "If 5 cats catch 5 mice in 5 minutes, how long for 100 cats to catch 100 mice?",
  "ideal": ["5 minutes", "5 mins"],
  "assertion": {
    "type": "contains_any",
    "case_sensitive": false
  },
  "metadata": {
    "difficulty": "medium",
    "tokens_est": 40,
    "tags": ["math", "logic"]
  }
}

Field reference

FieldTypeDescription
idstringUnique ID: {category}-{num}
categoryenumOne of 6 categories (see below)
inputstring | message[]Prompt text or chat message array
idealstring | string[] | nullExpected output(s) for matching
assertion.typeenumAssertion type (see below)
assertion.case_sensitiveboolDefault: true
assertion.thresholdfloat | nullFor semantic_similarity
assertion.judge_promptstring | nullFor llm_judge
metadata.difficultyeasy | medium | hardDifficulty rating
metadata.tokens_estintEstimated token count
metadata.tagsstring[]Optional tags

Assertion types

Source: assertions.py

TypeBehaviorStatus
exact_match Output (stripped) must equal ideal. Respects case_sensitive. implemented
contains Output must contain ideal as substring. implemented
contains_any Output must contain at least one string from ideal[]. implemented
is_json Output must be valid JSON. Strips markdown fences before parsing. implemented
llm_judge Uses the same provider as judge. Sends output + judge_prompt, expects YES/NO. implemented
semantic_similarity Embedding-based similarity against ideal with threshold. not implemented

Suite tiers

Data from manifest.toml:

TierTestsEst. costEst. tokensUse case
cheap 10 $0.05 350 Smoke test, CI gates
moderate 25 $0.25 950 Regular monitoring
comprehensive 75 $1.50 2,800 Full evaluation

Categories

CategoryExamples
reasoningLogic puzzles, math, deductive chains
factualKnowledge recall, date/number questions
instruction_followingFormat compliance, constraint adherence
codingCode generation, debugging, JSON output
safetyRefusal of harmful requests, boundary testing
creativeOpen-ended generation (judged by LLM)

Suite versioning

Suites are immutable. Once published, a suite version never changes. New tests go into a new version.

Each suite has a SHA-256 hash computed from sorted test case IDs and content. This hash is included in every submission for integrity verification.

from pramana.hash import hash_suite
print(hash_suite("src/pramana/suites/v1.0/cheap.jsonl"))
# sha256:a1b2c3...

File layout

src/pramana/suites/v1.0/
├── manifest.toml     # Tier metadata, categories, assertion types
├── cheap.jsonl       # 10 tests
├── moderate.jsonl    # 25 tests
└── comprehensive.jsonl  # 75 tests