Codifying the Judge: Scalable Evaluation via Program Distillation

Huang, Tzu-Heng; Qiu, Shengqi; Sala, Frederic

Codifying the Judge:
Scalable Evaluation via Program Distillation

Tzu-Heng Huang^*, Shengqi Qiu^*, Frederic Sala

^*Equal Contribution

University of Wisconsin-Madison

Paper ArXiv Page (TBD) Code & Demo 🤗 Dataset

LLM-as-a-Judge Challenges

LLM-as-a-judge has become the default for automated evaluation—but four fundamental challenges limit its scalability and reliability.

Inference Cost

Proprietary models like GPT-5 or Gemini-3-Pro make evaluation prohibitively expensive at scale. Open-weight deployments eliminate the API bill but remain expensive in latency and GPU demand.

Opaque Logic

LLMs can produce justifications, but their internal decision process is opaque: it is hard to verify whether a verdict reflects genuine reasoning or hallucination.

Systemic Bias

LLM judges favor verbosity, rich formatting, and emotionally charged language— stylistic signals that undermine reliability independent of actual response quality.

Re-inference Tax

Prompting pipelines are inflexible. Revising even one rubric criterion requires re-running inference over the entire dataset, incurring redundant cost and wasted cycles.

How do we address these? We propose a single shift: distill the judging logic into executable Python programs.

How PAJAMA Works

PAJAMA workflow diagram: program synthesis, execution, aggregation, and LLM routing.

Given a query and two candidate responses, a diverse pool of programmatic judges---synthesized by an LLM from curated rubrics---produces initial evaluations. These program outputs are calibrated and selected, and their verdicts are aggregated into a combined decision. Uncertain cases are then routed to an LLM judge to produce the final preference.

1 Program Synthesis

An LLM (e.g., Claude Sonnet 4.6) is prompted once to generate candidate judge programs (we use 10 evaluation rubrics × 8 variants each). Rubrics can cover Relevance, Readability, Completeness, Coherence, Clarity, Structure, and more. Each program is a pure Python function judging_function(query, response) → float using only built-in libraries—no ML, no GPU. A text-similarity filter removes near-duplicate logic to ensure committee diversity.

2 Execution & Calibration

Each program scores both responses; min-max normalization maps outputs to [0, 1]. A per-program threshold τ_j converts the quality-difference into a discrete vote ∈ {−1, 0, +1}, where 0 signals abstention. Thresholds are tuned on a small-size held-out validation set. Programs scoring below random chance (50%) are dropped; the top-k by validation accuracy form the final committee.

3 Aggregation via Weak Supervision

Committee votes are treated as noisy labeling functions. A Label Model (e.g., Snorkel framework) learns per-program accuracy weights from agreement and disagreement patterns, then combines votes into a single probabilistic preference label—robust to individual errors and conflicting signals.

4 Routing Uncertain Cases

When every program abstains or the aggregator's posterior is near 0.5, the pair is flagged as low-confidence and escalated to an LLM judge. Two routing signals are used: vote variance (disagreement among programs) and aggregator posterior (prediction confidence). This hybrid design trades a small fraction of LLM calls for improved accuracy on hard cases.

That is, PAJAMA is a framework for building programmatic judges that can be used to evaluate the quality of text responses. We validate PAJAMA on five preference datasets across four model families and confirm three claims: (1) programmatic judges match mid-sized LLM judges on accuracy at far higher throughput, (2) hybrid routing strictly advances the accuracy–throughput Pareto frontier, and (3) programmatic labels train competitive reward models at 45–50× lower cost than proprietary supervision.

Results

Can Programmatic Judges Match LLM Judges?

LLM judges are accurate but slow and expensive—often taking seconds per sample and costing dollars per thousand calls. We ask: can a committee of synthesized Python programs match a mid-sized LLM judge on accuracy while running orders of magnitude faster?

Setup. We evaluate on five pairwise preference datasets—JudgeLM, PandaLM, MultiPref, Prometheus, and Preference-700K—sampling 5,000 test examples per dataset with 500 held-out for calibration. Programs are synthesized once with Claude Sonnet 4.6. Baselines include proprietary judges (GPT-4.1, GPT-5 Thinking) and four open-weight model families (OLMo-2, Gemma-3, Qwen2.5), all served via vLLM at 64 concurrent requests.

Accuracy vs. throughput across five preference datasets and their average. Programmatic judges occupy a high-throughput regime no LLM can reach.

Figure. Accuracy vs. throughput across five preference datasets.

Programmatic judges reach an average accuracy of 78.11%—on par with OLMo-2-13B-Instruct and Qwen2.5-3B-Instruct—while running 47.25× faster.
On Prometheus, just 8 programs reach 88.78%, matching OLMo-2-7B-Instruct.
Against large variants (Qwen2.5-14B, Gemma-3-12B), programs run ~50× faster while approaching their accuracy.

Takeaway. A committee of fewer than twenty synthesized programs matches mid-sized LLM judges on accuracy while occupying a throughput regime no LLM can reach—running 2× faster than the smallest model tested while outperforming it by over 30 accuracy points on some datasets.

Can Hybrid Evaluation Push the Pareto Frontier?

Standalone programs handle most pairs efficiently, but ambiguous cases may still benefit from an LLM. We ask: if programs route only the uncertain cases to an LLM, does the hybrid system outperform what either approach achieves alone?

Setup. We evaluate a staged policy where every pair is first scored by the program committee; low-confidence pairs—flagged by vote variance or aggregator posterior—are escalated to an LLM. Sweeping the escalation threshold traces the full hybrid accuracy–throughput curve. We compare against three baselines that route by query length, response length, or random selection.

Routing within the OLMo-2 family: accuracy vs. throughput as the escalation threshold is swept across routing signals for 1B, 7B, and 13B model sizes.

Figure. Routing within the OLMo-2 family. Accuracy vs. throughput as the escalation threshold is swept across different routing signals. Endpoints represent pure-program (right) and pure-LLM (left) evaluation.

Hybrid evaluation Pareto frontier across OLMo-2, Gemma-3, and Qwen2.5. The red PAJAMA envelope dominates the gray LLM-only envelope at every throughput level.

Figure. Hybrid evaluation pushes the Pareto frontier further across 12 LLM judges. Each dashed line traces one judge's hybrid trajectory as the threshold is swept; the red envelope is the resulting PAJAMA frontier; the gray envelope is LLM-only.

PAJAMA-derived signals consistently dominate length-based and random routing baselines at every throughput budget.
This leads to OLMo-2-7B-Instruct: +5.0% accuracy at 2.9× higher throughput.
This leads to Qwen2.5-3B-Instruct: +2.6% accuracy at 2.2× higher throughput.
Moreover, the fallback mechanism adds negligible latency overhead compared to model-based routers.

Takeaway. Program-derived routing signals strictly dominate every baseline tested: at any throughput level they yield higher accuracy, and at any accuracy level they yield higher throughput. Because the routing decision is derived from the programs already running, it adds negligible overhead—unlike model-based routers that require an additional forward pass.

Are Programmatic Labels Good Enough for Reward Models?

LLM judges are widely used to generate preference labels for reward model training, but proprietary APIs make this expensive at scale. We ask: can programmatic labels replace proprietary LLM supervision for reward model distillation—at a fraction of the cost?

Setup. We sample 20,000 preference pairs from Prometheus and JudgeLM, relabel them with PAJAMA's programs, and fine-tune Qwen2.5-3B-Instruct under the Bradley–Terry objective. We compare against the same pipeline using original GPT-4 labels, evaluating in-domain accuracy and out-of-domain generalization on RewardBench (Chat, Chat Hard, Reasoning).

Labeling Source	API Calls	Estimated Cost	In-domain Acc.	RewardBench
Labeling Source	API Calls	Estimated Cost	In-domain Acc.	Chat	Chat Hard	Reasoning	Avg.
Trained on Prometheus Dataset
GPT-4	20,000 samples	$363.97	97.23	68.44	40.79	55.70	54.98
PAJAMA	80 programs (one-time)	$7.21 (50× cheaper)	92.20	79.33	30.04	60.58	56.65 (+1.67)
Trained on JudgeLM Dataset
GPT-4	20,000 samples	$296.37	90.24	67.88	38.82	65.13	57.28
PAJAMA	80 programs (one-time)	$6.49 (45× cheaper)	82.79	67.88	44.52	72.91	61.77 (+4.49)

Table. Reward model distillation with PAJAMA vs. GPT-4 labels.

In-domain, PAJAMA labels yield competitive accuracy at roughly 50× lower API cost (92.20% vs. 97.23% on Prometheus; 82.79% vs. 90.24% on JudgeLM).
Out-of-domain on RewardBench, PAJAMA-trained models outperform GPT-4-trained ones on average: +1.67 points on Prometheus and +4.49 points on JudgeLM.
The largest gains come from Reasoning (+4.88 on Prometheus, +7.78 on JudgeLM) and Chat (+10.89 on Prometheus), where programmatic judges encode explicit quality criteria that transfer well out-of-domain.
Program synthesis cost is O(1)—fixed regardless of dataset size—while GPT-4 labeling scales O(n), so the cost gap widens as more pairs are labeled.

Takeaway. Programmatic judges are an attractive labeling source for alignment workflows with limited annotation budgets. Out-of-domain, PAJAMA-trained reward models actually outperform GPT-4-trained ones on RewardBench— gaining +1.67 points on Prometheus and +4.49 on JudgeLM—while costing 45–50× less. As the dataset grows, the cost advantage widens further since synthesis cost is fixed.

Key Contributions

①A New Evaluation Paradigm

We distill LLM judging logic into a committee of executable Python programs, replacing per-sample model inference with a one-time synthesis and fast local execution.

②The PAJAMA System

A hybrid evaluation system that synthesizes diverse programmatic judges, calibrates and aggregates their verdicts via weak supervision, then routes uncertain cases to an LLM fallback.

③Advancing the Pareto Frontier

Across four model families, PAJAMA matches strong LLMs at a fraction of the cost and pushes the accuracy–throughput Pareto frontier beyond what any LLM-only system can reach.

④Cheap, High-Quality Reward Signals

Reward models distilled from programmatic judges outperform those trained on proprietary GPT-4 labels on RewardBench, at 45–50× lower labeling cost.

💡 Still calling ChatGPT / Claude / Gemini for every evaluation? Synthesize your programmatic judges once and run them locally — competitive quality, a fraction of the cost.

Interactive Demo

PAJAMA ships with an Evaluation Studio built in Streamlit. Inspect individual judge programs, edit them, regenerate via chat, and download labeled datasets.

Citation

If you find this work useful and interesting, please cite our paper. If you are playing with the preference datasets, also cite the original benchmark papers!

Latest Version:

@article{huang2026codifying,
  title   = {Codifying the Judge: Scalable Evaluation via Program Distillation},
  author  = {Huang, Tzu-Heng and Qiu, Shengqi and Sala, Frederic},
  year    = {2026}
}

Preliminary Workshop Version (accepted at ICML 2025 PRAL Workshop: Programmatic Representations for Agent Learning):

@article{huang2025time,
  title   = {Time to Impeach LLM-as-a-Judge: Programs are the Future of Evaluation},
  author  = {Huang, Tzu-Heng and Vishwakarma, Harit and Sala, Frederic},
  journal = {arXiv preprint arXiv:2506.10403},
  year    = {2025}
}

More Works from SprocketLab

Time to Impeach LLM-as-a-Judge: Programs are the Future of Evaluation

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

More from SprocketLab

Codifying the Judge:Scalable Evaluation via Program Distillation