Refiners¶
What is a Refiner?¶
A refiner is an optional pipeline stage that sits between generation and scoring. It takes a Checklist in and returns a refined Checklist out — pruning low-quality questions, merging duplicates, or selecting a diverse subset.
Refiners are primarily designed for corpus-level checklists, where a generator may produce dozens or hundreds of candidate questions from broad input (feedback comments, evaluation dimensions, etc.) and you need to whittle them down to a high-quality, non-redundant set.
Standalone Refiners vs Built-in Refinement¶
There are two ways refinement happens in autochecklist:
Standalone refiners — the four refiner classes (Deduplicator, Tagger, UnitTester, Selector) that you can explicitly add to a pipeline or chain manually.
Built-in refinement — some corpus-level generators use these same components internally as part of their generation pipeline:
InductiveGenerator: runs Deduplicator → Tagger → Selector internally. Control withskip_dedup,skip_tagging,skip_selectionflags.DeductiveGenerator: has augmentation (seed/elaboration/diversification) and filtering (alignment check, dimension consistency, redundancy removal via Deduplicator) built in. Control withaugmentation_modeandapply_filteringparams.InteractiveGenerator: runs a 5-stage pipeline with validation as the final stage. No skip flags — the stages are integral to the method.
When to Use Standalone Refiners
Use standalone refiners when you want to:
- Apply refinement to instance-level checklists
- Add refinement stages that a corpus generator doesn't include by default (e.g., adding
UnitTestertoInductiveGenerator, which doesn't run it by default) - Override or replace a generator's built-in refinement with your own configuration
The Four Standalone Refiners¶
Deduplicator¶
Merges semantically similar questions using embedding-based similarity detection.
Algorithm:
- Compute embeddings for all questions (via OpenAI embeddings API)
- Build a similarity graph — add an edge between questions whose cosine similarity ≥ threshold
- Find connected components (clusters of similar questions)
- Single-question clusters are kept as-is
- Multi-question clusters are merged by the LLM into one representative question
Key parameters:
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.85 |
Cosine similarity cutoff for considering questions duplicates |
Note: currently the Deduplicator uses the OpenAI Embeddings API directly and does not support other providers.
from autochecklist.refiners import Deduplicator
dedup = Deduplicator(model="openai/gpt-5-mini", similarity_threshold=0.85)
refined = dedup.refine(checklist)
Tagger¶
Filters questions by two quality criteria using zero-shot chain-of-thought LLM classification. A question must pass both criteria to be kept:
- Generally applicable — the question can be answered Yes/No for any input (no N/A scenarios)
- Section specific — the question evaluates a single focused aspect (not compound or cross-referencing)
from autochecklist.refiners import Tagger
tagger = Tagger(model="openai/gpt-5-mini")
refined = tagger.refine(checklist)
print(f"Filtered {len(checklist.items) - len(refined.items)} questions")
UnitTester¶
Validates enforceability — can an LLM scorer reliably distinguish responses that pass vs. fail each criterion?
Algorithm:
- For each question, find sample responses that scored YES (passing)
- LLM rewrites each passing sample to intentionally fail the criterion
- Score the rewritten sample — it should now score NO
enforceability_rate = correct_failures / total_passing_samples- Filter out questions below the threshold
Key parameters:
| Parameter | Default | Description |
|---|---|---|
enforceability_threshold |
0.7 |
Minimum enforceability rate to keep a question |
max_samples |
10 |
Maximum passing samples to test per question |
Required Inputs
UnitTester requires pre-existing sample data and scores — you need response texts and their checklist scores from a prior scoring run. This makes it a later-stage refiner, best used after you've done an initial round of scoring.
from autochecklist.refiners import UnitTester
tester = UnitTester(
model="openai/gpt-5-mini",
enforceability_threshold=0.7,
max_samples=10,
)
refined = tester.refine(checklist, samples=samples, sample_scores=scores)
Selector¶
Selects an optimally diverse subset of questions via beam search over embedding similarity.
Algorithm:
The objective function balances diversity against checklist length:
Where Diversity = 1 - average_pairwise_cosine_similarity and λ is the length penalty. Beam search explores candidate subsets, adding one question at a time, keeping the top candidates at each step.
If the checklist already has fewer questions than max_questions, it's returned unchanged.
Key parameters:
| Parameter | Default | Description |
|---|---|---|
max_questions |
20 |
Maximum questions to select |
beam_width |
5 |
Number of candidate subsets to track during search |
length_penalty |
0.0005 |
Penalty per additional question (λ) |
On Coverage vs Diversity
The original Feedback paper optimized selection on assignment matrices — maximizing coverage of input feedback by ensuring each comment maps to at least one selected question. Our Selector simplifies this to embedding diversity as a proxy for coverage. The source_feedback_indices metadata is tracked during generation but not used in the selection optimization.
from autochecklist.refiners import Selector
selector = Selector(max_questions=15, beam_width=5)
refined = selector.refine(checklist)
Chaining Refiners¶
The recommended order for chaining refiners is:
- Deduplicator first — reduces volume, saving API calls in later stages
- Tagger second — removes inapplicable or unfocused questions
- UnitTester third — validates enforceability (expensive, so run after pruning)
- Selector last — picks the optimal diverse subset from the refined pool
Via Pipeline¶
from autochecklist import pipeline
pipe = pipeline(
"feedback",
model="openai/gpt-5-mini",
refiners=["deduplicator", "tagger", "selector"],
)
Manual Chaining¶
from autochecklist.refiners import Deduplicator, Tagger, Selector
checklist = generator.generate(observations=[...])
checklist = Deduplicator(model="openai/gpt-5-mini").refine(checklist)
checklist = Tagger(model="openai/gpt-5-mini").refine(checklist)
checklist = Selector(max_questions=15).refine(checklist)