AI Benchmarks on US Gov. domains | By: Glenn Parham & Justin W. Lin | Last Updated: May 14, 2025

JointStaffBench

About: Benchmark evaluating LLMs on the US DoD’s Joint Staff. How well do LLMs “know” the US Military?

JointStaff Bench (3).pdf

Results

Model Provider Model Name Overall ⬇️ J1: Personnel J2: Intelligence J3: Operations J4: Logistics J5: Planning J6: Communications Systems Efficiency Index Model Parameter Count (in billions) Sample Size
OpenAI o3 89% 82% 92% 90% 86% 90% 92% 0.6 unknown n=4610
OpenAI gpt-4.1 87% 79% 86% 90% 85% 88% 89% 1.2 unknown n=4610
Anthropic claude-3-7-sonnet-20250219 85% 77% 87% 87% 80% 84% 85% 1.0 unknown n=4610
DeepSeek DeepSeek-V3 80% 72% 82% 81% 75% 80% 85% 1.8 unknown n=4610
Anthropic claude-3-5-haiku-20241022 80% 66% 84% 85% 75% 81% 84% 1.5 unknown n=4610
OpenAI o3-mini 78% 66% 82% 78% 77% 76% 81% 1.5 unknown n=4610
Meta Llama-3.1-405B-Instruct-Turbo 76% 68% 78% 83% 72% 73% 81% 1.3 405 n=4610
OpenAI gpt-4o 75% 69% 80% 78% 68% 73% 77% 1.1 unknown n=4610
Meta Llama-4-Scout-17B-16E-Instruct 75% 61% 81% 79% 71% 73% 79% 2.3 17 n=4610
Google gemini-2.0-flash 75% 67% 78% 78% 72% 74% 75% 2.5 unknown n=4610
Meta Llama-3-70B-Instruct-Turbo 73% 64% 75% 78% 66% 71% 80% 1.9 70 n=4610
OpenAI gpt-3.5-turbo 66% 56% 71% 71% 66% 64% 67% 1.8 unknown n=4610
Mistral Mistral-Small-24B-Instruct-2501 65% 53% 68% 72% 60% 61% 71% 1.9 24 n=4610
Google gemma-2-9b-it 66% 52% 70% 70% 62% 69% 71% 2.3 9 n=4610
Meta Llama-3.1-8B-Instruct-Turbo 65% 55% 70% 70% 62% 64% 71% 2.6 8 n=4610
Google gemma-3-4b-it-61e0f008 61% 48% 67% 65% 64% 58% 64% 3.3 4 n=4610
Meta Llama-3.2-3B-Instruct-Turbo 61% 53% 67% 59% 62% 57% 66% 3.0 3 n=4610
Qwen Qwen2.5-7B-Instruct-Turbo 61% 49% 66% 65% 62% 58% 60% 2.3 7 n=4610
Mistral Mistral-7B-Instruct-v0.3 58% 44% 63% 60% 59% 55% 60% 2.5 7 n=4610
Microsoft Phi-4-mini-instruct 56% 47% 58% 58% 55% 54% 60% 2.5 3 n=4610

Insights

FAQs

<aside> 💡

Have questions/concerns? Reach out to the team: [email protected] & [email protected]

Follow us on LinkedIn!

</aside>

GovBench Hackathon