Skip to content

AutoChecklist

GitHub Stars Python 3.10+ License Site

AutoChecklist is an open-source library that unifies LLM-based checklist evaluation into composable pipelines, in a pip-installable Python package (autochecklist) with CLI and UI features.

Features

  • Five checklist generator abstractions that organize methods from research by their reasoning strategies for deriving evaluation criteria
  • Composable pipelines eight built-in configurations implementing published methods, compatible with a unified scorer that consolidates three scoring strategies from literature
  • CLI for off-the-shelf evaluation with pre-defined pipelines
  • Multi-provider LLM backend with support for OpenAI, OpenRouter, and vLLM

Concepts

Terminology

  • input: The instruction, query, or task given to the LLM being evaluated (e.g., "Write a haiku about autumn").
  • target: The output being evaluated against the checklist (e.g., the haiku the LLM produced).
  • reference: An optional gold-standard response used by some methods to improve checklist generation.

Checklist Generator Abstractions

visualization

The core of the library is 5 generator classes, each implementing a distinct approach to producing checklists:

Level Generator Approach Analogy
Instance DirectGenerator Prompt → checklist Direct inference
Instance ContrastiveGenerator Candidates → checklist Counterfactual reasoning
Corpus InductiveGenerator Observations → criteria Inductive reasoning (bottom-up)
Corpus DeductiveGenerator Dimensions → criteria Deductive reasoning (top-down)
Corpus InteractiveGenerator Eval sessions → criteria Protocol analysis

Instance-level generators produce one checklist per input — criteria are tailored to each specific task. Corpus-level generators produce one checklist for an entire dataset — criteria capture general quality patterns derived from higher-level signals.

Each generator is customizable via prompt templates (.md files with {input}, {target} placeholders). You can use the built-in paper implementations, write your own prompts, or chain generators with different refiners and scorers to build custom evaluation pipelines.

Built-in Pipelines

The library includes built-in pipelines implementing methods from research papers (TICK, RocketEval, RLCF, CheckEval, InteractEval, and more). See Supported Pipelines for the full list and configuration details.

Scoring

A single configurable ChecklistScorer class supports all scoring modes:

Config Description
mode="batch" All items in one LLM call (efficient)
mode="batch", capture_reasoning=True Batch with per-item explanations
mode="item" One item per call
mode="item", capture_reasoning=True One item per call with reasoning
mode="item", primary_metric="weighted" Item weights (0-100) for importance
mode="item", use_logprobs=True Logprob confidence calibration

Refiners

Refiners are pipeline stages that clean up raw checklists before scoring. They're used by corpus-level generators internally, and can also be composed into custom pipelines:

  • Deduplicator — merges semantically similar items via embeddings
  • Tagger — filters by applicability and specificity
  • UnitTester — validates that items are enforceable
  • Selector — picks a diverse subset via beam search

Installation

uv pip install autochecklist

For full setup options (source install, editable mode, vLLM extra, .env keys), see Installation.

Start Here

  1. New users: Quick Start
  2. Composing custom evaluation flows: Pipeline Guide
  3. Running from terminal: CLI Guide
  4. Picking a method from papers: Supported Pipelines

Minimal Example

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
result = pipe(
    input="Write a haiku about autumn.",
    target="Leaves drift through cool dusk; amber fields breathe into night; geese stitch quiet skies.",
)
print(f"Pass rate: {result.pass_rate:.0%}")

CLI Example

# Full evaluation (generate + score)
autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

# Generate checklists only
autochecklist generate --pipeline tick --data inputs.jsonl -o checklists.jsonl \
  --generator-model openai/gpt-4o-mini

# Score with existing checklist
autochecklist score --data eval_data.jsonl --checklist checklist.json \
  -o results.jsonl --scorer-model openai/gpt-4o-mini

# List available pipelines
autochecklist list

For all CLI flags and resumable runs, see CLI Guide.

Examples

Detailed examples with runnable code:

Citation

TBA

License

Apache-2.0 (see LICENSE)