Supported Pipelines¶

This page documents each evaluation method implemented in autochecklist: the original paper's methodology, how it was evaluated, and how this library implements it.

Implementation Scope

These pipelines aim to capture the core algorithms of each paper (the generation strategies, scoring formulas, and refinement pipelines) in a composable, provider-agnostic way. Some paper-specific details (supervised score predictors, dataset-specific tuning, exact prompts) are simplified or omitted. Each pipeline section below notes the specific differences.

Overview¶

Pipelines are named presets that configure a generator with a specific prompt template and default scorer. Each pipeline implements a method from a research paper.

Method	Level	Pipeline Name	Generator Class	Approach	Primary Metric
TICK	Instance	`tick`	`DirectGenerator`	Direct inference	`pass_rate`
RocketEval	Instance	`rocketeval`	`DirectGenerator`	Direct inference	`normalized_score`
RLCF Direct	Instance	`rlcf_direct`	`DirectGenerator`	Direct inference	`weighted_score`
RLCF Candidate	Instance	`rlcf_candidate`	`ContrastiveGenerator`	Counterfactual reasoning	`weighted_score`
RLCF Candidates Only	Instance	`rlcf_candidates_only`	`ContrastiveGenerator`	Counterfactual reasoning	`weighted_score`
OpenRubrics Pairwise	Instance	`openrubrics_pairwise`	`ContrastiveGenerator`	Contrastive rubric generation	`pass_rate`
OpenRubrics Listwise	Instance	`openrubrics_listwise`	`ContrastiveGenerator`	Contrastive rubric generation	`pass_rate`
Feedback	Corpus	`feedback`	`InductiveGenerator`	Inductive (bottom-up)	`pass_rate`
CheckEval	Corpus	`checkeval`	`DeductiveGenerator`	Deductive (top-down)	`pass_rate`
InteractEval	Corpus	`interacteval`	`InteractiveGenerator`	Protocol analysis	`pass_rate`

Instance-Level Methods¶

TICK¶

Paper: arXiv:2410.03608

Original methodology: TICK generates checklists from the input instruction alone, using few-shot prompting. The LLM reads an instruction and produces 2–8 yes/no questions that capture the task requirements. Each question is scored individually with chain-of-thought reasoning. Evaluated on InFoBench. The key insight is that task-specific checklists provide more fine-grained evaluation than single-score rubrics.

Our implementation:

Setting	Value
Pipeline	`tick`
Generator	`DirectGenerator` with `tick/generate.md` template
Input required	`input` only
Temperature	0.7
Max items	8 (min 2)
Response schema	`ChecklistResponse` (unweighted)
Primary metric	`pass_rate`

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently...")
print(result.pass_rate)

Paper-Faithful Scoring

The original paper scores each item individually with reasoning. To match this behavior, override the scorer:

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer="item")

Differences from paper:

Paper feature	Our implementation
Per-item scoring with reasoning	Default is batch mode for efficiency; use `scorer="item"` for paper-faithful behavior
Few-shot prompt with hand-crafted examples	Structured JSON output (`ChecklistResponse` schema) replaces few-shot format guidance
Dataset-level DRFR aggregation metric	`BatchResult.micro_pass_rate` implements DRFR (micro-averaged pass rate) for dataset-level aggregation

RocketEval¶

Paper: arXiv:2503.05142

Original methodology: RocketEval generates checklists from the input instruction, a reference response, and optionally conversation history. The key innovation is in scoring: instead of binary yes/no, it uses logprob confidence calibration — computing P(Yes) / (P(Yes) + P(No)) from the model's token probabilities and mapping to 5 confidence levels. This produces finer-grained scores than binary answers. Evaluated on MT-Bench, AlpacaEval, WildBench, and Arena-Hard.

Our implementation:

Setting	Value
Pipeline	`rocketeval`
Generator	`DirectGenerator` with `rocketeval/generate.md` template
Input required	`input` + `reference`
Temperature	0.7
Max items	10 (min 1)
Response schema	`ChecklistResponse` (unweighted)
Primary metric	`normalized_score` (logprobs with structured fallback)

pipe = pipeline("rocketeval", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="Explain photosynthesis.", target="...", reference="...")
print(result.normalized_score)

Differences from paper:

Paper feature	Our implementation
Supervised score predictor (fitted per-query from annotations)	Not implemented; uses unsupervised arithmetic mean of normalized scores only
KL-divergence reweighting (`α_r`) blending supervised + unsupervised scores	Not implemented (requires annotation data)
Conversation history as generation context	Supported via template placeholders but not enforced

RLCF (Reinforcement Learning from Checklist Feedback)¶

Paper: arXiv:2507.18624

Original methodology: RLCF introduces weighted checklist items where each question has an importance weight from 0–100:

100 = critical requirement
75 = important quality factor
50 = good response indicator
25 = preference/style
< 25 = nice-to-have

The key insight is that not all criteria are equally important — weighting produces more discriminative evaluation signals, particularly useful for RLHF training. Evaluated on WildChat. Three generation modes are defined:

RLCF Direct¶

Prompts the LLM to write a weighted checklist from the input instruction and reference response.

Setting	Value
Pipeline	`rlcf_direct`
Generator	`DirectGenerator` with `rlcf/direct.md` template
Input required	`input` + `reference`
Temperature	0.0
Max items	7 (min 1)
Response schema	`WeightedChecklistResponse`
Primary metric	`weighted_score`

pipe = pipeline("rlcf_direct", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="...", target="...", reference="...")
print(result.weighted_score)

RLCF Candidate¶

First generates varied-quality candidate responses, then derives criteria by contrasting candidates against the reference.

Setting	Value
Pipeline	`rlcf_candidate`
Generator	`ContrastiveGenerator` with `rlcf/candidate_based.md` template
Input required	`input` + `reference` + candidates (auto or manual)
Temperature	0.0
Max items	7 (min 1)
Response schema	`WeightedChecklistResponse`
Primary metric	`weighted_score`

# Auto-generate candidates
pipe = pipeline("rlcf_candidate", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="...", target="...", reference="...")

# Or pass candidates manually
result = pipe(input="...", target="...", reference="...",
              candidates=["response A", "response B"])

# Or use a separate model for candidate generation
pipe = pipeline("rlcf_candidate", generator_model="openai/gpt-5-mini",
                generator_kwargs={"candidate_provider": "vllm",
                                  "candidate_base_url": "http://gpu:8000/v1"})

Candidate Generation Strategy

The original paper generates candidates from different "worker" models to get naturally diverse responses. The ContrastiveGenerator supports this and a convenience fallback:

Multiple candidate_models (paper-faithful): Each model generates exactly 1 response at temperature 0.7. Diversity comes from model differences. Use by instantiating ContrastiveGenerator directly and passing it to ChecklistPipeline.
Single model (convenience fallback): Generates num_candidates responses (default 4) at temperature 0.9. Diversity comes from high-temperature sampling. This is what pipeline() with generator_kwargs={"candidate_provider": ...} uses.

RLCF Candidates Only¶

Derives criteria from candidate diversity alone — no reference response needed.

Setting	Value
Pipeline	`rlcf_candidates_only`
Generator	`ContrastiveGenerator` with `rlcf/candidates_only.md` template
Input required	`input` + candidates (auto or manual)
Temperature	0.0
Max items	7 (min 1)
Response schema	`WeightedChecklistResponse`
Primary metric	`weighted_score`

pipe = pipeline("rlcf_candidates_only", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(input="...", target="...", candidates=["resp1", "resp2"])

Differences from paper (all RLCF pipelines):

Paper feature	Our implementation
Scoring via 0–100 continuous AI judge scores averaged across samples	Uses binary YES/NO per item with importance-weighted aggregation
Separability filtering (top 40% most discriminative pairs)	Not implemented; all generated items are kept
Direct Preference Optimization (DPO) training loop	Out of scope — this library focuses on checklist generation and scoring, not policy training
Exact verifier programs for discrete checks (e.g., format, length)	Not implemented; all scoring is LLM-based

OpenRubrics CRG (Contrastive Rubric Generation)¶

Paper: arXiv:2510.07743

Original methodology: OpenRubrics generates universal scoring rubrics from preference data through a three-step process: (1) extract explicit requirements ("hard rules") from the input, (2) analyze concrete differences between better and worse responses, (3) abstract those differences into universal principles. The key insight is that contrasting responses of known quality reveals evaluation criteria that are both generalizable and grounded in real quality differences. Two modes are supported: pairwise (chosen vs rejected) and listwise (ranked list of responses).

Our implementation:

OpenRubrics does not auto-generate candidates, the preference data (chosen/rejected pairs or ranked responses) is provided as input. Items are categorized as hard_rule (from explicit requirements) or principle (abstracted from response differences).

OpenRubrics Pairwise¶

Generates rubric items by analyzing why a chosen response is superior to a rejected one.

Setting	Value
Pipeline	`openrubrics_pairwise`
Generator	`ContrastiveGenerator` with `openrubrics/pairwise.md` template
Input required	`input` + `candidates` (dict with `chosen`/`rejected` keys)
Temperature	0.0
Max items	15 (min 1)
Response schema	`CategorizedChecklistResponse`
Primary metric	`pass_rate`

from autochecklist import pipeline

pipe = pipeline("openrubrics_pairwise", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(
    input="Explain photosynthesis.",
    target="Photosynthesis involves chlorophyll...",
    candidates={"chosen": "Photosynthesis is the process by which plants convert...",
                "rejected": "Plants use sunlight to grow."},
)
print(result.pass_rate)

OpenRubrics Listwise¶

Generates rubric items by analyzing an ordered list of responses (best to worst).

Setting	Value
Pipeline	`openrubrics_listwise`
Generator	`ContrastiveGenerator` with `openrubrics/listwise.md` template
Input required	`input` + `candidates` (list, ordered best→worst)
Temperature	0.0
Max items	20 (min 1)
Response schema	`CategorizedChecklistResponse`
Primary metric	`pass_rate`

pipe = pipeline("openrubrics_listwise", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(
    input="Explain photosynthesis.",
    target="Photosynthesis involves chlorophyll...",
    candidates=["Best response...", "Good response...", "Weak response..."],
)
print(result.pass_rate)

Categorized Output

OpenRubrics pipelines produce CategorizedChecklistResponse — each item has a category field set to either hard_rule or principle. Use checklist.by_category() to split items by type.

Differences from paper:

Paper feature	Our implementation
Rubrics condition a reward model for binary preference prediction (A vs B)	Rubric items scored individually as YES/NO with pass rate aggregation
Preference-label consistency filtering (rejection sampling to keep rubrics that predict the correct preference)	No filtering; all generated items are kept
Reward model SFT training pipeline (Rubric-RM)	Out of scope — this library focuses on rubric generation and scoring, not RM training

Corpus-Level Methods¶

Feedback (From Feedback to Checklists)¶

Paper: arXiv:2507.17717

Original methodology: Transforms user/reviewer feedback into evaluation checklists through a 5-stage pipeline: (1) generate candidate questions from feedback batches, (2) merge redundant questions via embeddings, (3) filter by applicability and specificity, (4) validate enforceability via unit testing, (5) select a diverse subset via beam search. Evaluated on clinical note quality. The key insight is that real user feedback captures evaluation criteria that predefined rubrics miss.

Our implementation:

Setting	Value
Registry name	`feedback`
Generator	`InductiveGenerator`
Input required	`observations` (list of strings)
Primary metric	`pass_rate`
Built-in refinement	Deduplicator → Tagger → Selector

The generator has a built-in refinement pipeline that runs by default:

Generate — LLM creates candidate questions from batches of feedback comments
Deduplicate — merges similar questions via embedding similarity (threshold 0.85)
Tag — filters by applicability and specificity
Select — beam search for diverse subset (if more questions than max_questions)

Each stage can be selectively skipped. See InductiveGenerator usage for code examples.

Unit Testing Not Included by Default

The original paper includes enforceability validation (stage 4) in its pipeline. Our InductiveGenerator does not run the UnitTester by default because it requires pre-existing sample scores. You can add it as a standalone refiner after an initial scoring round.

Selection Simplification

The original paper's selection step optimizes on assignment matrices — maximizing coverage of input feedback by ensuring each comment maps to at least one selected question (via source_feedback_indices). Our Selector simplifies this to embedding diversity as a proxy for coverage.

Differences from paper:

Paper feature	Our implementation
Enforceability unit testing (stage 4) in the default pipeline	Available as standalone `UnitTester` refiner, but not run by default (requires pre-existing sample scores)
Selection optimizes assignment matrices for feedback coverage	Simplified to embedding diversity via beam search; `source_feedback_indices` are tracked but not used in selection
Domain-specific (clinical note sections)	Generalized to any domain via the `domain` parameter

CheckEval¶

Paper: arXiv:2403.18771

Original methodology: Generates checklists from human-written evaluation dimension definitions through: seed question generation → augmentation (elaboration for granularity, diversification for alternative framings) → filtering (alignment check, dimension consistency, redundancy removal). Evaluated on SummEval and Topical-Chat. The key insight is that structured dimensions produce more systematic and complete evaluation criteria.

Our implementation:

Setting	Value
Registry name	`checkeval`
Generator	`DeductiveGenerator`
Input required	`dimensions` (list of `DeductiveInput`)
Primary metric	`pass_rate`
Built-in refinement	Augmentation + optional filtering (with dedup)

Three augmentation modes control question volume:

Mode	Questions per Sub-Dimension	Description
`seed`	2	Minimal seed questions
`elaboration`	5	Detailed, granular questions
`diversification`	4	Alternative framings of criteria

Optional filtering (apply_filtering=True) runs: alignment check → dimension consistency check → deduplication.

See DeductiveGenerator usage for code examples.

Differences from paper:

Paper feature	Our implementation
Diversification and elaboration run independently in parallel from the same seeds, then merged before filtering	Use `augmentation_mode="combined"` to run both elaboration and diversification from seeds (paper-faithful). Individual modes (`"seed"`, `"elaboration"`, `"diversification"`) are also available.
Seed questions written by human experts	Seed questions are LLM-generated (from dimension + sub-dimension definitions)
Supervised weighting via linear regression on annotation data	Not implemented; all questions are equally weighted (uniform `pass_rate`)

InteractEval¶

Paper: arXiv:2409.07355

Original methodology: Collects think-aloud data from both human evaluators and LLMs, then extracts evaluation criteria through a 5-stage pipeline: (1) component extraction (recurring themes), (2) attribute clustering under components, (3) key question generation (1 per component), (4) sub-question generation (2–3 per component), (5) validation and refinement. Pass rate is scaled to a 1–5 dimension score. Evaluated on SummEval and ELLIPSE. The key insight is that combining human and LLM evaluation perspectives produces more comprehensive criteria.

Our implementation:

Setting	Value
Registry name	`interacteval`
Generator	`InteractiveGenerator`
Input required	`inputs` (list of `InteractiveInput`)
Primary metric	`pass_rate`
Built-in refinement	5-stage pipeline with validation

The 5-stage pipeline:

Component extraction — identifies up to max_components (default 5) recurring themes
Attribute clustering — groups attributes under each component
Key question generation — 1 yes/no question per component
Sub-question generation — 2–3 sub-questions per component
Validation — refines and validates the final question set

See InteractiveGenerator usage for code examples.

InteractEval-style 1–5 scoring is available via result.score.scaled_score_1_5.

Differences from paper:

Paper feature	Our implementation
Think-aloud collection protocol (4 humans + 4 LLMs with rubrics and sample texts)	Think-aloud data is provided as input (`InteractiveInput`); the collection protocol is external to the library
Validation uses 7 explicit criteria (yes/no answerable, dimension concepts, minimizes subjectivity, semantically distinct, positive framing, dimension-relevant, actionable)	Validation is handled by a single LLM call (stage 5) that checks these criteria holistically rather than as separate classification passes
Deduplication via LLM judgment in validation step	LLM-only dedup within the validation prompt; does not use embedding-based `Deduplicator` refiner (available as standalone but not wired in)
Five think-aloud conditions tested (single-LLM, single-human, multi-LLM, multi-human, combined)	All inputs are merged; the `source` field on `InteractiveInput` tracks provenance but doesn't affect the generation pipeline

Comparison Table¶

Method	Level	Input Required	Reference?	Candidates?	Primary Metric	Built-in Refinement?
TICK	Instance	`input`	No	No	`pass_rate`	No
RocketEval	Instance	`input`	Yes	No	`normalized_score`	No
RLCF Direct	Instance	`input`	Yes	No	`weighted_score`	No
RLCF Candidate	Instance	`input`	Yes	Yes (auto)	`weighted_score`	No
RLCF Candidates Only	Instance	`input`	No	Yes (auto)	`weighted_score`	No
OpenRubrics Pairwise	Instance	`input`	No	No (provided)	`pass_rate`	No
OpenRubrics Listwise	Instance	`input`	No	No (provided)	`pass_rate`	No
Feedback	Corpus	`observations` list	No	No	`pass_rate`	Yes (dedup, tag, select)
CheckEval	Corpus	`dimensions`	No	No	`pass_rate`	Yes (augment, filter, dedup)
InteractEval	Corpus	`think-aloud`	No	No	`pass_rate` (or 1–5)	Yes (5-stage pipeline)