GovBench: Hackathon Debrief

4 Hours. 30 Hackers. 9 Benchmarks.

<aside> ℹ️

We are converting GovBench into a non-profit 501(c)(3). Interested in this work or donating? Follow us on LinkedIn & subscribe to our email list.

</aside>

On June 14, 2025, GovBench coalesced 30 hackers and government experts together in Washington D.C. to answer one question in many different ways: how well do LLMs perform on government domains? Teams spanned nearly every single cabinet-level Department in the Federal Government.

tl;dr: Building government‑specific benchmarks at scale and reasonable cost demands a fusion of human domain expertise & review with AI‑generated synthetic data.

Why This Work Matters

As governments worldwide prepare to spend billions on AI chatbots, we first need a clear picture of how these models behave on real government tasks. GovBench fills that gap—just as code‑generation benchmarks did for software—by measuring LLM performance across public‑sector domains. Our work focuses on governments as users of the technology, not on pushing the frontier of general AI capabilities.

Our Methodology

Creating domain-specific evals/benchmarks is an ever-evolving field of research. At the time of this hackathon, our methodology was the following:

Five‑step pipeline from source docs to graded results.

Some key elements of this approach:

Foundational Documents: Government work uniquely must abide by laws, regulations, doctrines, memos, policies, etc. We believe that documents where this is codified are the best place to start when synthetically creating domain-specific benchmarks.
Synthetic Q&A Generation: We use Gemini’s 2.5-Pro-Preview to extract and generate question and answer pairs from foundational documents.
Synthetic Grading: We use GPT-4.1-nano as the judge model via the UK AI Safety Institute’s Inspect evaluation framework.

For more info about our methodology, check out our sample open-source benchmark ConstitutionBench.

About Hackers

Participants self-reported various information about their background.

90% are proficient in Python
1/3 are Government Subject Matter Experts, 2/3 are coders
Where Government Subject Matter Experts work(ed):
Departments that participants want to work on: