AI Benchmarks on US Gov. domains | By: Glenn Parham & Justin W. Lin | Last Updated: May 14, 2025
About: Benchmark evaluating LLMs on the US DoD’s Joint Staff. How well do LLMs “know” the US Military?
Benchmark includes 4,610 questions (3,910 multiple choice + 700 free form) across the Joint Staff’s directorates.
Benchmark synthetically generated from **Joint Staff’s Joint Publications:** https://huggingface.co/datasets/GovBench/JCSJointPublications/tree/main
UK AI Security Institute’s Inspect open-source framework used for all evaluations: https://inspect.aisi.org.uk/
View sample questions here: https://huggingface.co/datasets/GovBench/JCSJointPublications
Efficiency Index:
$$ \text{Efficiency Index} \;=\; \log\!\biggl( \frac {\text{Overall Performance}} {\displaystyle\frac{\text{Input Token Cost} \;+\;\text{Output Token Cost}}{2}} \biggr)
$$
Model Provider | Model Name | Overall ⬇️ | J1: Personnel | J2: Intelligence | J3: Operations | J4: Logistics | J5: Planning | J6: Communications Systems | Efficiency Index | Model Parameter Count (in billions) | Sample Size |
---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI | o3 | 89% | 82% | 92% | 90% | 86% | 90% | 92% | 0.6 | unknown | n=4610 |
OpenAI | gpt-4.1 | 87% | 79% | 86% | 90% | 85% | 88% | 89% | 1.2 | unknown | n=4610 |
Anthropic | claude-3-7-sonnet-20250219 | 85% | 77% | 87% | 87% | 80% | 84% | 85% | 1.0 | unknown | n=4610 |
DeepSeek | DeepSeek-V3 | 80% | 72% | 82% | 81% | 75% | 80% | 85% | 1.8 | unknown | n=4610 |
Anthropic | claude-3-5-haiku-20241022 | 80% | 66% | 84% | 85% | 75% | 81% | 84% | 1.5 | unknown | n=4610 |
OpenAI | o3-mini | 78% | 66% | 82% | 78% | 77% | 76% | 81% | 1.5 | unknown | n=4610 |
Meta | Llama-3.1-405B-Instruct-Turbo | 76% | 68% | 78% | 83% | 72% | 73% | 81% | 1.3 | 405 | n=4610 |
OpenAI | gpt-4o | 75% | 69% | 80% | 78% | 68% | 73% | 77% | 1.1 | unknown | n=4610 |
Meta | Llama-4-Scout-17B-16E-Instruct | 75% | 61% | 81% | 79% | 71% | 73% | 79% | 2.3 | 17 | n=4610 |
gemini-2.0-flash | 75% | 67% | 78% | 78% | 72% | 74% | 75% | 2.5 | unknown | n=4610 | |
Meta | Llama-3-70B-Instruct-Turbo | 73% | 64% | 75% | 78% | 66% | 71% | 80% | 1.9 | 70 | n=4610 |
OpenAI | gpt-3.5-turbo | 66% | 56% | 71% | 71% | 66% | 64% | 67% | 1.8 | unknown | n=4610 |
Mistral | Mistral-Small-24B-Instruct-2501 | 65% | 53% | 68% | 72% | 60% | 61% | 71% | 1.9 | 24 | n=4610 |
gemma-2-9b-it | 66% | 52% | 70% | 70% | 62% | 69% | 71% | 2.3 | 9 | n=4610 | |
Meta | Llama-3.1-8B-Instruct-Turbo | 65% | 55% | 70% | 70% | 62% | 64% | 71% | 2.6 | 8 | n=4610 |
gemma-3-4b-it-61e0f008 | 61% | 48% | 67% | 65% | 64% | 58% | 64% | 3.3 | 4 | n=4610 | |
Meta | Llama-3.2-3B-Instruct-Turbo | 61% | 53% | 67% | 59% | 62% | 57% | 66% | 3.0 | 3 | n=4610 |
Qwen | Qwen2.5-7B-Instruct-Turbo | 61% | 49% | 66% | 65% | 62% | 58% | 60% | 2.3 | 7 | n=4610 |
Mistral | Mistral-7B-Instruct-v0.3 | 58% | 44% | 63% | 60% | 59% | 55% | 60% | 2.5 | 7 | n=4610 |
Microsoft | Phi-4-mini-instruct | 56% | 47% | 58% | 58% | 55% | 54% | 60% | 2.5 | 3 | n=4610 |
<aside> 💡
Have questions/concerns? Reach out to the team: [email protected] & [email protected]
Follow us on LinkedIn!
</aside>