4 Hours. 30 Hackers. 9 Benchmarks.

<aside> ℹ️

We are converting GovBench into a non-profit 501(c)(6). Interested in this work or donating? Follow us on LinkedIn & subscribe to our email list.

</aside>

On June 14, 2025, GovBench coalesced 30 hackers and government experts together in Washington D.C. to answer one question in many different ways: how well do LLMs perform on government domains? Teams spanned nearly every single cabinet-level Department in the Federal Government.

tl;dr: Building government‑specific benchmarks at scale and reasonable cost demands a fusion of human domain expertise & review with AI‑generated synthetic data.

Why This Work Matters

As governments worldwide prepare to spend billions on AI chatbots, we first need a clear picture of how these models behave on real government tasks. GovBench fills that gap—just as code‑generation benchmarks did for software—by measuring LLM performance across public‑sector domains. Our work focuses on governments as users of the technology, not on pushing the frontier of general AI capabilities.

Our Methodology

Creating domain-specific evals/benchmarks is an ever-evolving field of research. At the time of this hackathon, our methodology was the following:

Five‑step pipeline from source docs to graded results.

Five‑step pipeline from source docs to graded results.

Some key elements of this approach:

For more info about our methodology, check out our sample open-source benchmark ConstitutionBench.

About Hackers

Participants self-reported various information about their background.