Command-Line Interface¶

AutoChecklist provides a CLI for running evaluations directly from the terminal.

Installation¶

pip install autochecklist

After installation, the autochecklist command is available globally.

Fastest Working Flow¶

# 1) Generate + score with a built-in pipeline
autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

# 2) Inspect available built-ins and components
autochecklist list

API Keys¶

AutoChecklist needs an API key for the LLM provider. Three options (in order of precedence):

--api-key flag — pass directly on the command line
Environment variable — export OPENROUTER_API_KEY=sk-or-... (or OPENAI_API_KEY for OpenAI provider)
.env file — create a .env file in your working directory with OPENROUTER_API_KEY=sk-or-...

Subcommands¶

`autochecklist run` — Full evaluation¶

Generate checklists and score targets in one step.

autochecklist run --pipeline tick --data eval_data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

Flag	Required	Description
`--pipeline`	No*	Pipeline name (tick, rocketeval, rlcf_direct, etc.)
`--config`	No	Path to pipeline config JSON file (see Pipeline Configs)
`--data`	Yes	Path to input JSONL file
`-o, --output`	No	Output JSONL path (enables resume on re-run)
`--overwrite`	No	Overwrite output instead of resuming
`--generator-model`	No	Model for generation (e.g. `openai/gpt-4o-mini`)
`--scorer-model`	No	Model for scoring
`--scorer`	No	Override default scorer (batch, item, weighted, normalized)
`--generator-prompt`	No	Path to custom generator prompt (.md file)
`--scorer-prompt`	No	Path to custom scorer prompt (.md file)
`--provider`	No	LLM provider: openrouter (default), openai, vllm
`--base-url`	No	Custom base URL (e.g. vLLM server)
`--api-key`	No	API key (default: from environment)
`--api-format`	No	API format: chat (default), responses
`--input-key`	No	JSONL key for input field (default: `input`)
`--target-key`	No	JSONL key for target field (default: `target`)

*Provide one of: --pipeline, --generator-prompt, or --config.

`autochecklist generate` — Checklists only¶

Generate checklists without scoring.

autochecklist generate --pipeline tick --data inputs.jsonl -o checklists.jsonl \
  --generator-model openai/gpt-4o-mini

Same flags as run (including --config), minus --scorer-model, --scorer, --scorer-prompt, and --target-key.

`autochecklist score` — Score only¶

Score targets against a pre-existing checklist.

autochecklist score --data eval_data.jsonl --checklist checklist.json \
  -o results.jsonl --scorer-model openai/gpt-4o-mini

Flag	Required	Description
`--data`	Yes	JSONL file with target field
`--checklist`	Yes	Path to checklist JSON file
`--scorer`	No	Scorer type (default: batch)
`--scorer-model`	No	Model for scoring
`--scorer-prompt`	No	Path to custom scorer prompt (.md file)
`-o, --output`	No	Output JSONL path
`--overwrite`	No	Overwrite output

Plus the same provider flags (--provider, --base-url, --api-key, --api-format, --input-key, --target-key).

Choosing the Right Subcommand¶

Goal	Command
Generate and score in one pass	`autochecklist run`
Generate checklists only	`autochecklist generate`
Score using an existing checklist	`autochecklist score`
List available components	`autochecklist list`

`autochecklist list` — Discover components¶

autochecklist list                        # list generators (default)
autochecklist list --component scorers    # list scorers
autochecklist list --component refiners   # list refiners

Input Format¶

The input JSONL file should have one JSON object per line:

{"input": "Write a haiku about nature", "target": "Leaves fall gently down..."}
{"input": "Write a greeting", "target": "Hello! How are you?"}

Use --input-key and --target-key if your data uses different field names:

autochecklist run --pipeline tick --data data.jsonl --input-key instruction --target-key response

Custom Prompts¶

Use --generator-prompt and --scorer-prompt to pass custom prompt templates without modifying code:

autochecklist run --generator-prompt my_eval.md --data data.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

See Custom Prompts for prompt template format details.

Resumable Runs¶

When --output is set, results are written incrementally. If interrupted, re-running the same command resumes from where it left off. Use --overwrite to start fresh.

# First run (interrupted after 50/100 examples)
autochecklist run --pipeline tick --data data.jsonl -o results.jsonl --generator-model openai/gpt-4o-mini

# Resume (picks up at example 51)
autochecklist run --pipeline tick --data data.jsonl -o results.jsonl --generator-model openai/gpt-4o-mini

# Start over
autochecklist run --pipeline tick --data data.jsonl -o results.jsonl --overwrite --generator-model openai/gpt-4o-mini

Pipeline Configs¶

Save and reuse custom pipeline configurations as JSON files. This is useful for sharing evaluation setups across teams or projects.

# Run with a pipeline config file
autochecklist run --config my_pipeline.json --data data.jsonl -o results.jsonl \
  --generator-model openai/gpt-4o-mini --scorer-model openai/gpt-4o-mini

Config JSON format:

{
  "name": "my-eval",
  "generator_class": "direct",
  "generator_prompt": "Generate yes/no evaluation questions for:\n\n{input}",
  "scorer_mode": "item",
  "scorer_prompt": null,
  "primary_metric": "weighted",
  "capture_reasoning": false
}

See register_custom_pipeline() and save_pipeline_config() in the Python API for creating configs programmatically.