OrderGuard
Permutation-marginalized LLM judging, reranking, and tool selection that reduces order sensitivity at inference time.
Candidate order is not a meaningful signal — but many LLM pipelines accidentally treat it as one.
When you use an LLM to pick 1 option out of N (LLM-as-a-judge, RAG reranking, agent tool/action selection), simply reordering the exact same options can silently change the winner and make systems flaky.
OrderGuard is a training-free inference wrapper that reduces order sensitivity by (approximately) marginalizing over permutations using forced-choice logprob scoring plus an adaptive early-stop.
Key ideas
- Permutation-group averaging (inference-time invariance): treat option order as a nuisance variable and marginalize it out by aggregating scores over permutations.
- Low-variance design (LaTIn): use a position-balanced cyclic schedule (Latin-square style) to reduce variance vs random shuffles.
- Adaptive stopping: stop when the aggregated distribution stabilizes (JS divergence threshold), allocating more test-time compute only to hard examples.
Headline results (Qwen3, reproducible)
- Single-shot multiple-choice is extremely order-sensitive: with just 10 random shuffles, the predicted winner flips on 58–75% of examples (depending on the model).
- OrderGuard improves macro accuracy by +2.8 to +4.6 pp (and up to +7.6 pp on a single dataset).
Biggest per-dataset gains (accuracy, absolute pp vs single):
| Model | Dataset | Gain | Method |
|---|---|---|---|
| Qwen/Qwen3-0.6B | OpenBookQA | +7.6 pp | LaTIn |
| Qwen/Qwen3-0.6B | TruthfulQA(MC1) | +7.0 pp | PCons |
| Qwen/Qwen3-0.6B | CSQA | +6.6 pp | PCons |
| Qwen/Qwen3-1.7B | HellaSwag | +7.4 pp | LaTIn |
| Qwen/Qwen3-1.7B | OpenBookQA | +5.2 pp | LaTIn |
| Qwen/Qwen3-1.7B | MMLU(all) | +3.6 pp | LaTIn |
Minimal API
from orderguard.methods import latin_consensus
from orderguard.modeling import load_lm
lm = load_lm("Qwen/Qwen3-1.7B", torch_dtype=None)
question = "Pick the best next tool for: extract the answer from a table."
choices = [
"WebSearch: use the browser to find information online.",
"Calculator: do arithmetic precisely.",
"TableParser: read structured tables and extract fields.",
"WriteCode: write a short script to compute the result.",
]
res = latin_consensus(lm, question, choices, max_perms=7, min_perms=3, js_eps=0.005, seed=0)
print("winner:", choices[res.pred_index], "perms_used:", res.meta["perms_used"])