GPQA benchmark teardown and scoring checklist
A practical teardown of GPQA and GPQA-Diamond. Use the checklist to run, score, and report evaluations safely.

Short answer: GPQA (Graduate-Level Google-Proof Q&A) is a small, expert-written multiple-choice benchmark meant to be hard even with web access. It is widely used to test deep STEM reasoning across biology, chemistry, and physics. It should be reported with strict, reproducible scoring rules because small methodology differences can move scores by several points.
- What it is: 448 expert-authored questions in the main set, plus an extended set and a harder 198-question GPQA-Diamond subset commonly used on leaderboards.
- Why it matters: PhD-level experts score around 65% (higher when discounting clear mistakes identified later), while skilled non-experts with unrestricted web access score around 34%.
- What to report: accuracy (overall + per-domain), prompt/decoding settings, refusal/abstention handling, and variance/CI.
- Core scoring rule: deterministic multiple-choice selection with a strict parser, with explicit handling for malformed outputs.
- Leakage constraint: do not publish GPQA question text, answer options, or answer keys; describe methodology and aggregate results only.
- Trade-off: GPQA is high-signal for hard science reasoning, but small and narrow, so scores have higher variance than broad benchmarks.
What is GPQA (and what does it actually measure)?
GPQA stands for Graduate-Level Google-Proof Q&A benchmark. It is a dataset and evaluation setup designed to test whether a model can answer very difficult, expert-authored STEM questions that resist shallow pattern matching and quick search. Its value comes from how often it forces multi-step reasoning, not just memorization.
In practice, GPQA is most useful as a stress test for “scientific QA” claims. If a model is marketed as capable of graduate-level science assistance, GPQA is one of the first benchmarks used to quantify that claim.
GPQA in one line
GPQA measures: closed-ended scientific reasoning under high difficulty, limited shortcuts, and multi-domain STEM coverage.
Why is GPQA considered “Google-proof”?
“Google-proof” is a claim about shortcut resistance. The benchmark is designed so that looking up a phrase, formula, or keyword is unlikely to produce the answer directly. Reported validation results show non-experts performing near chance even with extended time and open-web access, while domain experts perform far better.
This does not mean retrieval is useless in general. It means GPQA is a poor fit for “retrieve a snippet, copy the answer” workflows, and tends to reward background knowledge and reasoning.
What “Google-proof” does (and does not) imply
- Does: reduces direct lookup wins; increases need for multi-hop reasoning; amplifies calibration and uncertainty issues.
- Does not: guarantee zero contamination; guarantee every question is unambiguous; guarantee broad coverage of all scientific fields.
GPQA main vs extended vs diamond: which one are people scoring?
Most confusion around GPQA results comes from mixing splits. In practice, you will see three variants discussed. Always name the split and dataset revision when publishing results.
- Main set: 448 questions across biology, chemistry, and physics.
- Extended set: a larger pool used in some tooling and analysis contexts; counts may vary by repository/version.
- GPQA-Diamond: a curated 198-question subset widely used for standardized leaderboards and frontier comparisons.
Practical guidance: use GPQA-Diamond for leaderboard comparability. Use the main set when you want per-domain breakdowns and slightly more stability, but expect wider error bars than large benchmarks.
How GPQA was constructed and validated (why people trust it more than “random hard questions”)
GPQA’s reputation comes from process: questions were written by domain experts and validated through multiple stages intended to ensure difficulty and reduce ambiguity. Some questions also include explanations used for adjudication and later error detection, even if typical evaluations only use the question, options, and gold label.
This setup makes GPQA a common reference point in oversight research. It creates conditions where non-experts struggle to evaluate correctness, which mirrors real-world evaluation and auditing challenges.
But: acknowledge the limits
No benchmark is perfect. GPQA is small, and small benchmarks are sensitive to a handful of ambiguous items, prompt/decoding changes, and contamination over time. Treat single-number comparisons as directional unless methodology is closely matched.
Dataset access and leakage policy (what not to publish)
GPQA is available in the broader ecosystem via dataset hubs and repositories, and some variants may be access-controlled to reduce casual scraping. Regardless of access method, maintainers request that people do not reveal examples (question text, answer options, or answer keys) online to reduce leakage into training corpora.
Leakage-safe writing: publish methodology and aggregate results, not raw items. In internal processes, reference item IDs or hashes instead of pasting full questions into systems that may become public.
Leakage-safe writing rules (use these in internal docs too)
- OK to share: aggregate scores, per-domain aggregates, methodology, prompt templates (without question text), and evaluation code that loads data but does not print it.
- Do not share: raw questions, answer choices, gold answers, or screenshots of dataset rows.
- In bug reports: reference a question ID/hash internally; avoid pasting the item into tickets that sync to public systems.
GPQA scoring methodology: the core rules you must lock down
GPQA looks simple (multiple-choice accuracy), but reproducibility depends on a small set of decisions. Lock these down and report them consistently to make comparisons meaningful. For small splits like GPQA-Diamond, uncertainty reporting is not optional if you want defensible claims.
Scoring checklist (tool-agnostic)
- Split: main vs extended vs GPQA-Diamond, including exact dataset revision and any filtering.
- Prompt format: identical instructions, option formatting, and answer-format constraints across runs.
- Decoding: specify
temperature,top_p,max_tokens, and whether you use majority vote or single-sample. - Randomness control: fixed seeds where applicable; deterministic decoding preferred for benchmarking.
- Answer extraction: strict parser (e.g., single letter A/B/C/D) with explicit handling for prose, multiple letters, or no answer.
- Refusals/abstentions: define how refusals are counted (typically incorrect) and track refusal rate separately.
- Scoring: exact-match to the gold label; no judge model needed if extraction is reliable.
- Metrics beyond accuracy: per-domain accuracy; calibration if you collect confidence; token usage and latency if comparing deployment cost.
- Variance: confidence intervals (e.g., Wilson) and/or bootstrap, especially on n=198.
Chance baseline (interpretation sanity check)
GPQA is commonly formatted as four-option multiple choice, so chance accuracy is 25%. Scores near chance can indicate parsing failures or systematic refusals rather than true capability. Extremely high results should trigger a contamination and methodology audit.
Suggested prompting and decoding standards (so your score means what you think it means)
There is no single official prompt. Standardization is what separates a useful benchmark number from a prompt-specific artifact. Keep output constraints simple so parsing is reliable.
A minimal, leakage-safe prompt template
- Request a brief rationale (optional for internal use) and a final answer letter.
- Constrain final output to one of
A,B,C, orD. - Prefer deterministic decoding (
temperature=0) for headline scores; if you use self-consistency, report it explicitly.
One explicit trade-off: CoT transparency vs comparability
Allowing chain-of-thought can increase accuracy and make debugging easier. It can also reduce comparability across providers and create extra text that breaks parsers. A common compromise is to keep reasoning internal while enforcing a strict, machine-parsable final answer field.
How to run a GPQA evaluation (reference implementations)
Teams typically pick a standardized batch runner or a prompt/program optimization framework. The tooling matters less than logging the same metadata and enforcing the same scoring rules. Avoid emitting or storing raw dataset items in logs.
Path A: LM Evaluation Harness (standardized batch scoring)
GPQA fits naturally into a batch eval suite alongside other benchmarks. The core implementation tasks are a dataset adapter, a prompt spec, and a robust answer extractor. Keep decoding deterministic for comparability.
# Pseudocode: GPQA run skeleton (do not print dataset items)
# 1) Load GPQA split (main or diamond) from an approved source
# 2) For each question: build prompt with A/B/C/D choices
# 3) Call model with deterministic decoding (temperature=0)
# 4) Parse final answer letter; if invalid, mark incorrect and log reason
# 5) Compute overall + per-domain accuracy and export a run card (JSON)
run = {
"benchmark": "GPQA-Diamond",
"dataset_revision": "<commit-or-version>",
"model": "<provider/model-version>",
"decoding": {"temperature": 0, "top_p": 1.0, "max_tokens": 16},
"metrics": {"accuracy": null, "bio": null, "chem": null, "phys": null},
"parsing_fail_rate": null,
"refusal_rate": null
}
Path B: DSPy (prompt/program optimization, then held-out evaluation)
If you use DSPy to optimize a pipeline, treat GPQA as test-only when possible. If tuning is unavoidable, tune on a separate split and clearly label what was optimized. Evaluate once on GPQA-Diamond with frozen settings.
# Pseudocode: DSPy-style flow (leakage-safe)
# 1) Load a tuning set (not the target test split)
# 2) Compile/optimize a ChainOfThought or structured signature
# 3) Freeze the program, then evaluate once on GPQA-Diamond
# 4) Log settings, tuning data provenance, and evaluation split
signature = "question, choices -> answer"
optimizer = "MIPROv2" # example
# compile(...)
# evaluate(...)
# report accuracy + per-domain + parsing failure reasons
Interpreting GPQA scores: what a “good” number means (and what it doesn’t)
GPQA is often cited because it shows a meaningful gap between experts, non-experts, and many AI baselines. Because it is small, score differences can be partly noise unless you control methodology tightly. Use uncertainty intervals when communicating results.
Use score bands cautiously
- ~25%: chance-level; often indicates parsing or refusal issues, or genuine inability.
- ~30–40%: above chance but still far from expert performance; historically common for strong baselines in some reports.
- ~50–70%: approaching expert-reported ranges; requires stronger contamination and methodology scrutiny.
Comparing leaderboards: the three comparability traps
- Split mismatch: main vs diamond vs extended are not comparable.
- Prompt/decoder mismatch: temperature, few-shot vs zero-shot, and self-consistency can move results.
- Contamination controls: policies vary and are often under-disclosed.
When comparing across sources, prioritize those that publish full run metadata: model version, prompt, decoding, scoring code, and dataset revision. Treat unexplained spreads as at least partly methodology noise.
What to measure beyond accuracy (because accuracy alone hides failure modes)
1) Per-domain breakdown
Report biology vs chemistry vs physics. A model can look acceptable overall while being weak in one domain, which matters for domain-specific products. Per-domain reporting also helps detect dataset or prompt biases.
2) Calibration and confidence (optional)
Calibration helps distinguish “often right but overconfident” from “often wrong and aware of it.” GPQA does not require confidence by default, but you can add it and report measures like ECE or reliability curves. If you collect confidence, describe how it is elicited and parsed.
3) Refusal/abstention rate
Track how often the model declines to answer. In a multiple-choice benchmark it typically counts as incorrect, but it is still valuable telemetry. Report refusal rate separately from accuracy.
4) Parsing failure taxonomy
- Multiple letters selected
- No letter selected
- Letter outside A–D
- Answer embedded in prose with ambiguous extraction
Logging these prevents “benchmark regressions” that are actually formatting regressions. It also makes it easier to harden your parser without changing the model. Keep the taxonomy stable across runs.
GPQA vs MMLU-Pro vs HLE: when to use which
GPQA is often discussed alongside other evaluation staples. Use it when you need a targeted stress test for deep STEM reasoning under shortcut-resistant conditions. For broad, general assistant capability, pair it with broader benchmarks.
| Benchmark | Best for | Strength | Limitation |
|---|---|---|---|
| GPQA / GPQA-Diamond | Deep STEM reasoning, oversight-style difficulty | Expert-written, “Google-proof” design, strong expert-gap signal | Small and narrow; higher variance; leakage-sensitive |
| MMLU-Pro | Broad knowledge + reasoning across many subjects | Coverage and comparability; strong general-purpose signal | Less targeted at extreme expert-level STEM difficulty than GPQA |
| HLE | Frontier-level academic expert evaluation (often broader, sometimes multimodal) | Hard and diverse; can emphasize precision and calibration | More evaluation complexity; extraction and judging details can dominate |
Rule of thumb: use GPQA-Diamond for hard science reasoning that resists lookup. Use MMLU-Pro for broad coverage. Add HLE when you need frontier difficulty across many domains and can support more rigorous evaluation.
GPQA reporting template (copy/paste scorecard)
Use a standard scorecard for internal and external reporting. Consistent run cards reduce confusion and prevent “benchmark theater.” Include dataset revision, model version, prompt spec, and scoring rules.
Run card: required fields
| Field | Example | Why it matters |
|---|---|---|
| Benchmark + split | GPQA-Diamond (198) | Split mismatch is the most common comparability failure |
| Dataset source + revision | Dataset hub + commit hash | Locks the exact items evaluated |
| Model identifier | provider/model + version date | APIs can change; version pinning matters |
| Prompt spec | zero-shot, strict A–D output | Prompting can move scores materially |
| Decoding | temp=0, top_p=1 | Stochastic decoding adds variance |
| Scoring/parser | regex: final answer letter | Prevents silent parsing errors |
| Accuracy (overall) | 0.XX | Headline metric |
| Accuracy (bio/chem/phys) | 0.XX / 0.XX / 0.XX | Detects domain imbalance |
| Parsing fail rate | X% | Often the true cause of regressions |
| Refusal rate | X% | Separates safety behavior from capability |
| CI / uncertainty | Wilson 95% CI | Small-N needs error bars |
Scoring checklist (printable)
- Split locked: main/extended/diamond named and versioned
- No few-shot drift: same number of shots across models (preferably zero-shot unless justified)
- Decoder locked: temp/top_p/max_tokens identical
- Parser tested: unit tests for common malformed outputs
- Refusals defined: counted as incorrect and separately tracked
- Leakage-safe logs: never store question text in artifacts
- Variance reported: CI/bootstraps included
Common failure modes I see in GPQA writeups
“We got X% on GPQA” (but it’s unclear which GPQA)
Always specify GPQA main vs GPQA-Diamond. Without the split and revision, the number is not actionable. If you filtered items, say how and why.
Answer formatting inflates or deflates the score
Models often output variants like “The answer is (C)” or “C.”. Naive parsing can fail and silently depress scores. Use a robust extractor and report parsing-failure rate alongside accuracy.
Contamination hand-waving
As GPQA becomes more discussed, contamination risk rises. State your policy on training and fine-tuning data, and how you reduce benchmark leakage. If you cannot guarantee anything, say so and interpret results cautiously.
Who this is for: ML engineers and eval researchers who need a reproducible GPQA / GPQA-Diamond score for model releases; AI safety and oversight teams using expert-gap-style benchmarks; PMs comparing LLM providers for scientific QA. If you only need a broad general snapshot, start with MMLU-Pro and add GPQA when you need a harder, science-specific stress test.

