ExoBrain
agentic AIbenchmarks and evalsinference economicsmodel releases

Top agentic tool users

Kimi K2 Thinking demonstrates superior agentic performance and cost-efficiency compared to leading proprietary models on complex dual-control benchmarks.

ExoBrain

1 min read
Top agentic tool users

This chart helps us understand the new elite of tool using models. Kimi K2 Thinking’s 93% score on τ²-Bench looks impressive, outperforming GPT-5. The benchmark tests dual-control scenarios where AI agents must guide humans through complex technical support tasks, maintaining coherence across hundreds of interactions.

In the overall Artificial Analysis Intelligence Index (v3.0 incorporates 10 evaluations: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, and 𝜏²-Bench) K2 Thinking comes in just behind GPT-5, and ahead of Grok-4 and Claude 4.5 Sonnet. Interestingly, to run all of these benchmarks K2 cost $379 versus $1,888 for Grok 4 and $913 for running GPT-5. See the full analysis breakdown and track agentic performance, cost and speed on the excellent Artificial Analysis website.

Subscribe to the ExoBrain Weekly Newsletter

Stay up to date with AI. Get analysis of the week's most important stories, plus a focused roundup across business, governance, research and infrastructure.

Follow us on LinkedIn