Skip to content

Pipeline & Composability

The Composability Model

autochecklist is built around a three-stage pipeline where each component is independently configurable and replaceable:

Generator → Refiners → Scorer

Like UNIX pipes, each stage has a clear interface: generators produce Checklist objects, refiners take a Checklist in and return a refined Checklist out, and scorers take a Checklist plus a target response and return a Score. You can swap any component without affecting the others.

Built-in pipelines provide sensible defaults — each pipeline pairs a generator with its recommended scorer — but you can override any component.

pipeline() vs ChecklistPipeline

Both create the same pipeline object. The difference is whether the default scorer is auto-attached:

Construction Generator Default Scorer Use Case
pipeline("tick") Resolved Auto-attached Quick start with sensible defaults
ChecklistPipeline(from_preset="tick") Resolved Auto-attached Same as pipeline()
ChecklistPipeline(generator="tick") Resolved None Explicit composition — bring your own scorer
ChecklistPipeline(generator=my_gen) Instance None Pre-configured components

pipeline() is equivalent to ChecklistPipeline(from_preset=...). All three component types — generator, scorer, and refiners — accept name strings or pre-configured instances.

The pipeline() Factory

The pipeline() function is the recommended entry point. Pass a pipeline name and it auto-selects the generator class, prompt templates, and default scorer:

from autochecklist import pipeline

pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")

Key parameters:

Parameter Description
task Pipeline name: "tick", "rocketeval", "rlcf_direct", "rlcf_candidate", "rlcf_candidates_only", "feedback", "checkeval", "interacteval"
generator_model LLM model for checklist generation (e.g., "openai/gpt-5-mini")
scorer_model LLM model for scoring (optional; inherits provider defaults if omitted)
scorer Override default scorer: "batch", "item", "weighted", "normalized"
refiners List of refiner names: ["deduplicator", "tagger", "selector"]
provider LLM provider: "openrouter" (default), "openai", "vllm"
base_url Custom API endpoint
client Pre-configured LLM client (e.g., VLLMOfflineClient)
api_format "chat" (default) or "responses" (OpenAI Responses API)
generator_kwargs Extra keyword arguments passed to the generator (e.g., candidate_provider, candidate_base_url)

Single Evaluation

Call the pipeline directly with input and target to generate a checklist and score in one step:

result = pipe(input="Write a haiku about autumn.", target="Leaves fall gently...")

print(result.pass_rate)        # 0.83
print(result.checklist)        # the generated Checklist
print(result.score)            # the full Score object

The returned PipelineResult gives access to:

  • result.checklist — the generated (and refined, if applicable) checklist
  • result.score — the full Score object
  • result.score.primary_score — the primary metric value (aliases pass_rate, weighted_score, or normalized_score depending on pipeline)
  • result.pass_rate, result.weighted_score, result.normalized_score — convenience accessors

Generate Only

To inspect or save the checklist before scoring:

checklist = pipe.generate(input="Write a haiku about autumn.")
print(checklist.to_text())

# Score later with the same or different pipeline
score = pipe.score(checklist, target="Leaves fall gently...", input="Write a haiku about autumn.")

Batch Evaluation

Instance-Level: Per-Input Generation + Scoring

For instance-level generators, run_batch generates a unique checklist for each input and scores it:

data = [
    {"input": "Write a haiku", "target": "Leaves fall..."},
    {"input": "Write a limerick", "target": "There once was..."},
]
result = pipe.run_batch(data, show_progress=True)
print(f"Mean score: {result.mean_score:.0%}")
print(f"Macro pass rate: {result.macro_pass_rate:.0%}")
print(f"Micro pass rate (DRFR): {result.micro_pass_rate:.0%}")

mean_score averages Score.primary_score across all examples — it respects the pipeline's primary_metric setting. For pass-rate pipelines it equals macro_pass_rate; for weighted pipelines it averages weighted scores; for normalized pipelines it averages normalized scores.

micro_pass_rate computes the Decomposed Requirements Following Ratio (DRFR) from InFoBench — the total number of YES answers divided by the total number of applicable items across all examples. This differs from macro_pass_rate (mean of per-example pass rates) by giving equal weight to each checklist item rather than each example.

Corpus-Level: Shared Checklist Scoring

For corpus-level generators, generate one checklist first, then score all responses against it:

pipe = pipeline("feedback", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")
checklist = pipe.generate(observations=["Too verbose", "Missing examples", ...])

result = pipe.run_batch(data, checklist=checklist, show_progress=True)

BatchResult

The BatchResult object provides:

result.mean_score        # mean of Score.primary_score (respects primary_metric)
result.macro_pass_rate   # macro-average: mean of per-example pass rates
result.micro_pass_rate   # micro-average (DRFR): total_yes / total_applicable across all items
result.scores            # list of individual Score objects

# Export
df = result.to_dataframe()
result.to_jsonl("results.jsonl")

Adding Refiners

Pass refiner names to add refinement stages between generation and scoring:

pipe = pipeline("feedback", generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini",
                refiners=["deduplicator", "tagger", "selector"])

See Refiners for details on each refiner and recommended chaining order.

Overriding Components

# Override scorer
pipe = pipeline("tick", generator_model="openai/gpt-5-mini", scorer="item")

# Use OpenAI directly instead of OpenRouter
pipe = pipeline("tick", provider="openai", generator_model="gpt-5-mini")

# Use vLLM server
pipe = pipeline("tick", provider="vllm", base_url="http://gpu:8000/v1",
                generator_model="meta-llama/Llama-3-8B")

# Separate provider for candidate generation (ContrastiveGenerator only)
pipe = pipeline("rlcf_candidate", generator_model="openai/gpt-5-mini",
                generator_kwargs={"candidate_provider": "vllm",
                                  "candidate_base_url": "http://gpu:8000/v1"})

Explicit Composition

Use ChecklistPipeline directly when you want to pick each component yourself. All three accept name strings:

from autochecklist import ChecklistPipeline

pipe = ChecklistPipeline(generator="tick", scorer="item",
                         generator_model="openai/gpt-5-mini", scorer_model="openai/gpt-5-mini")

For components that need configuration beyond what a name string provides (e.g., ContrastiveGenerator with candidate_models), pass a pre-configured instance:

from autochecklist import ChecklistPipeline, ContrastiveGenerator

gen = ContrastiveGenerator(
    method_name="rlcf_candidate",
    model="openai/gpt-5-mini",
    candidate_models=["openai/gpt-5-mini", "anthropic/claude-3-haiku"],
)
pipe = ChecklistPipeline(generator=gen, scorer="weighted", scorer_model="openai/gpt-5-mini")
result = pipe(input="...", target="...", reference="...")

You can also use components entirely independently, outside any pipeline:

from autochecklist import DirectGenerator, ChecklistScorer
from autochecklist.refiners import Deduplicator

gen = DirectGenerator(method_name="tick", model="openai/gpt-5-mini")
checklist = gen.generate(input="Write a haiku")
refined = Deduplicator(model="openai/gpt-5-mini").refine(checklist)
score = ChecklistScorer(mode="batch", model="openai/gpt-5-mini").score(refined, target="Leaves fall...")

Generator-Scorer Pairing

Each pipeline has a default scorer chosen to match the original paper's methodology. You can override any pairing.

Pipeline Level Default Scorer Primary Metric
tick Instance batch pass_rate
rocketeval Instance normalized normalized_score
rlcf_direct Instance weighted weighted_score
rlcf_candidate Instance weighted weighted_score
rlcf_candidates_only Instance weighted weighted_score
feedback Corpus batch pass_rate
checkeval Corpus batch pass_rate
interacteval Corpus batch pass_rate

Override with: pipeline("tick", scorer="item")

Custom Pipeline Registration

Register a custom pipeline (generator + scorer + prompts) as a reusable preset. Once registered, it works just like a built-in pipeline:

from autochecklist import register_custom_pipeline, pipeline

# Register from config values with scorer settings
register_custom_pipeline(
    "my_eval",
    generator_prompt="Generate yes/no questions for:\n\n{input}",
    scorer_mode="item",
    primary_metric="weighted",
)
pipe = pipeline("my_eval", generator_model="openai/gpt-5-mini")

# With logprobs-based scoring (auto-enabled by primary_metric="normalized")
register_custom_pipeline(
    "my_logprob_eval",
    generator_prompt="Generate yes/no questions for:\n\n{input}",
    scorer_mode="item",
    primary_metric="normalized",
)

# Or register from an existing pipeline instance
from autochecklist import ChecklistPipeline, DirectGenerator, ChecklistScorer
pipe = ChecklistPipeline(
    generator=DirectGenerator(custom_prompt="...", model="openai/gpt-5-mini"),
    scorer=ChecklistScorer(mode="item", primary_metric="weighted", model="openai/gpt-5-mini"),
)
register_custom_pipeline("my_eval_v2", pipe)

Scorer config parameters:

Parameter Values Description
scorer_mode "batch", "item" All items in one call vs. one per call
primary_metric "pass", "weighted", "normalized" Which metric Score.primary_score aliases. "normalized" auto-enables logprobs.
capture_reasoning True/False Include per-item reasoning
scorer_prompt text, built-in name ("rlcf", "rocketeval"), or Path Custom scorer prompt. null/omitted = use default prompt for the mode.

Save / Load Pipeline Configs

Export and share pipeline configs as JSON:

from autochecklist import save_pipeline_config, load_pipeline_config

save_pipeline_config("my_eval", "my_eval.json")

# Later, or on another machine:
name = load_pipeline_config("my_eval.json")
pipe = pipeline(name, generator_model="openai/gpt-5-mini")

JSON format:

{
  "name": "my_eval",
  "generator_class": "direct",
  "generator_prompt": "Generate yes/no questions for:\n\n{input}",
  "scorer_mode": "item",
  "scorer_prompt": null,
  "primary_metric": "weighted",
  "capture_reasoning": false
}

Field notes: - scorer_mode: null = no default scorer attached; "batch" or "item" = attach a scorer. - scorer_prompt: null = use the default prompt for the mode. Accepts a built-in name ("rlcf", "rocketeval") or inline prompt text — resolved automatically at pipeline construction time. - primary_metric: null = defaults to "pass". Other options: "weighted", "normalized". Setting "normalized" auto-enables logprobs for confidence-calibrated scoring.

Override Protection

Built-in pipelines (tick, rocketeval, etc.) are protected. Overriding raises ValueError unless you pass force=True:

register_custom_pipeline("tick", generator_prompt="...", force=True)  # warns but succeeds

Custom pipelines can be overridden silently.

Component Discovery

from autochecklist import list_generators, get_generator

print(list_generators())  # ['tick', 'rocketeval', 'rlcf_direct', ..., 'my_eval']
gen_cls = get_generator("tick")