Scorers¶

What is a Scorer?¶

A scorer is the final stage of the evaluation pipeline. It takes a target response and a checklist and evaluates the response by answering each yes/no question. The result is a Score object with aggregate metrics and per-item answers.

All scoring modes use structured JSON output with automatic fallback — if the LLM provider supports JSON schema enforcement, it's used; otherwise, the schema is included in the prompt and the response is parsed.

The Score Object¶

Every scorer returns a Score with these key properties:

Property	Type	Description
`primary_score`	`float`	Primary metric — aliases whichever metric the pipeline designated
`pass_rate`	`float`	Proportion of YES answers
`weighted_score`	`float`	Importance-weighted aggregation (always computed)
`normalized_score`	`float`	Logprob confidence average, or pass_rate if no logprobs
`item_scores`	`list[ItemScore]`	Per-question answers with optional reasoning/confidence
`scaled_score_1_5`	`float`	Convenience mapping: `pass_rate * 4 + 1` (range 1.0–5.0)
`primary_metric`	`str`	Which metric `primary_score` aliases: `"pass"`, `"weighted"`, or `"normalized"`

All three aggregate metrics (pass_rate, weighted_score, normalized_score) are always computed. The score property returns whichever one the pipeline considers primary.

Each ItemScore contains:

answer: "yes" or "no"
reasoning: per-item explanation (when capture_reasoning=True)
confidence: logprob-derived probability (when use_logprobs=True)
confidence_level: categorical level like yes_90, no_30, unsure (when logprobs available)

ChecklistScorer¶

There is a single, configurable ChecklistScorer class. The three key parameters are:

Parameter	Values	Effect
`mode`	`"batch"` or `"item"`	Batch = 1 LLM call; Item = N calls (one per question)
`primary_metric`	`"pass"`, `"weighted"`, `"normalized"`	Which metric `Score.primary_score` aliases. `"normalized"` auto-enables logprobs.
`capture_reasoning`	`bool`	Item mode: include per-item reasoning

Batch Mode¶

Evaluates all checklist items in a single LLM call. Questions are formatted as Q1: ..., Q2: ... (1-based indexing), and the LLM returns YES/NO for each.

Tradeoffs: Fastest and cheapest (1 API call), but may lose accuracy on very long checklists.

Default for: tick, feedback, checkeval, interacteval.

from autochecklist import ChecklistScorer

scorer = ChecklistScorer(mode="batch", model="openai/gpt-4o-mini")
score = scorer.score(checklist, target="...")
print(f"Pass rate: {score.pass_rate:.0%}")

With reasoning (capture_reasoning=True): Each answer includes a reasoning explanation. Useful for debugging checklist quality.

scorer = ChecklistScorer(mode="batch", capture_reasoning=True, model="openai/gpt-4o-mini")
score = scorer.score(checklist, target="...")
for item_score in score.item_scores:
    print(f"{item_score.answer}: {item_score.reasoning}")

Item Mode¶

Evaluates one item per LLM call. Supports three configurations:

With reasoning (capture_reasoning=True): Each call returns YES/NO plus a reasoning explanation. Most faithful to the TICK paper methodology.

scorer = ChecklistScorer(mode="item", capture_reasoning=True, model="openai/gpt-4o-mini")
score = scorer.score(checklist, target="...")
for item_score in score.item_scores:
    print(f"{item_score.answer}: {item_score.reasoning}")

Weighted metric (primary_metric="weighted"): Uses item weights (0-100) for importance-weighted scoring. Designed for RLCF checklists.

$$\text{weighted_score} = \frac{\sum_{i} w_i \times s_i}{\sum_{i} w_i}$$

Default for: rlcf_direct, rlcf_candidate, rlcf_candidates_only.

scorer = ChecklistScorer(mode="item", primary_metric="weighted", model="openai/gpt-4o-mini")
score = scorer.score(checklist, target="...")
print(f"Weighted: {score.weighted_score:.2f}, Primary: {score.primary_score:.2f}")

With logprobs (use_logprobs=True): Captures logprobs for YES/NO tokens to produce confidence-calibrated scores.

$$\text{confidence} = \frac{P(\text{Yes})}{P(\text{Yes}) + P(\text{No})}$$

Confidence Range	Answer	Level
< 0.2	NO	`no_10`
0.2 – 0.4	NO	`no_30`
0.4 – 0.6	NO	`unsure`
0.6 – 0.8	YES	`yes_70`
> 0.8	YES	`yes_90`

Falls back to binary YES/NO when the model doesn't support logprobs. Default for: rocketeval.

scorer = ChecklistScorer(mode="item", use_logprobs=True, primary_metric="normalized", model="openai/gpt-4o-mini")
score = scorer.score(checklist, target="...")
print(f"Normalized: {score.normalized_score:.2f}")

Choosing a Configuration¶

Config	LLM Calls	Primary Metric	Per-Item Reasoning	Best For
`mode="batch"`	1	`pass_rate`	No	Large checklists, cost-sensitive
`mode="batch", capture_reasoning=True`	1	`pass_rate`	Yes	Batch with explanations
`mode="item"`	N	`pass_rate`	No	Per-item evaluation
`mode="item", capture_reasoning=True`	N	`pass_rate`	Yes	Debugging, interpretability
`mode="item", primary_metric="weighted"`	N	`weighted_score`	No	RLCF weighted criteria
`mode="item", primary_metric="normalized"`	N	`normalized_score`	No	Confidence-aware (RocketEval)

Overriding Defaults¶

Each built-in pipeline has a default scorer, but you can override it:

from autochecklist import pipeline, ChecklistScorer

# Use item mode with TICK for per-item reasoning
pipe = pipeline("tick", generator_model="openai/gpt-4o-mini", scorer="item")

# Use batch mode with RLCF for speed
pipe = pipeline("rlcf_direct", generator_model="openai/gpt-4o-mini", scorer="batch")

# Use a fully custom scorer
scorer = ChecklistScorer(mode="item", capture_reasoning=True, model="openai/gpt-4o-mini")
pipe = pipeline("tick", generator_model="openai/gpt-4o-mini", scorer=scorer)

All scorers also accept a custom_prompt parameter. For custom scorer prompts, use register_custom_scorer() with optional config kwargs (mode, primary_metric, capture_reasoning). See Custom Prompts.

Deprecated Class Names¶

The old scorer class names (BatchScorer, ItemScorer, WeightedScorer, NormalizedScorer) still work as factory functions that emit DeprecationWarning. They create ChecklistScorer instances with the appropriate config:

Old Name	Equivalent
`BatchScorer(...)`	`ChecklistScorer(mode="batch", ...)`
`ItemScorer(...)`	`ChecklistScorer(mode="item", capture_reasoning=True, ...)`
`WeightedScorer(...)`	`ChecklistScorer(mode="item", primary_metric="weighted", ...)`
`NormalizedScorer(...)`	`ChecklistScorer(mode="item", use_logprobs=True, primary_metric="normalized", ...)`

Scorer Prompts¶

The scorer uses different prompt templates depending on the pipeline:

Prompt	File	Used By
Default batch	`prompts/scoring/batch.md`	`tick`, corpus-level presets
Default item	`prompts/scoring/item.md`	Default for item mode
RLCF	`prompts/scoring/rlcf.md`	`rlcf_direct`, `rlcf_candidate`, `rlcf_candidates_only`
RocketEval	`prompts/scoring/rocketeval.md`	`rocketeval` (includes `{history}` placeholder)

Pipeline presets automatically select their scorer prompt via the scorer_prompt key. You can override with custom_prompt:

scorer = ChecklistScorer(mode="item", custom_prompt="Your custom prompt with {input}, {target}, {question}")

Response Schemas¶

Schema	Fields	Used When
`ItemScoringResponse`	`{answer}`	Item mode, no reasoning
`ItemScoringResponseReasoned`	`{answer, reasoning}`	Item mode + `capture_reasoning`
`BatchScoringResponse`	`{answers: [{question_index, answer}]}`	Batch mode
`BatchScoringResponseReasoned`	`{answers: [{question_index, answer, reasoning}]}`	Batch mode + `capture_reasoning`

The old WeightedScoringResponse name is a deprecated alias for ItemScoringResponse.