Netfox
HomeQ&AAnti-ScamNotifications
© 2026 Netfox. All rights reserved.
Terms of ServicePrivacy PolicyAbout UsEditorial Policy
Comment
Technology

Xiaomi MiMo V2.5 Pro Leads GDPval-AA Agentic Benchmarks

Galvin Prescott
Galvin Prescott
May 4, 20265 min
0
0
0
81
Xiaomi's MiMo V2.5 Pro tops the GDPval-AA agentic benchmark with a score of 1578, outperforming Kimi K2.6 and DeepSeek V4 Pro in real-world work tasks.

Xiaomi’s MiMo V2.5 Pro has secured a leading position on the GDPval-AA benchmark, scoring 1578 and outperforming established peers in agentic real-world work tasks. The model’s release coincides with a planned shift to open weights, potentially placing it at the top of the open-weights intelligence hierarchy.

MiMo V2.5 Pro establishes a new baseline for agentic task performance

The recent performance data from Artificial Analysis places MiMo V2.5 Pro at the top of the GDPval-AA index, a benchmark specifically designed to measure how effectively AI models handle multi-step, agentic workflows. With a score of 1578, the model leads a competitive field that includes DeepSeek V4 Pro (1554), GLM-5.1 (1535), and Moonshot’s Kimi K2.6 (1484).

MiMo V2.5 Pro leads its peer group on agentic tasks. The model scores 1578 on GDPval-AA, and places it in the top tier for real-world work tasks among recent releasesMiMo V2.5 Pro leads its peer group on agentic tasks. The model scores 1578 on GDPval-AA, and places it in the top tier for real-world work tasks among recent releases

This advancement represents a focused iterative improvement by Xiaomi. The V2.5 Pro arrived just over a month after the March 19, 2026, release of its predecessor, MiMo V2 Pro. In that window, the developers achieved an 11% increase in instruction following (IFBench) and a 6% gain in reasoning (HLE). However, the gains are not uniform across all metrics; the model saw a marginal regression in critical reasoning (CritPt), falling from 5% to 4%, suggesting that while its ability to follow complex paths has improved, its underlying verification logic may be under strain.

The model’s ranking at 54 on the broader Intelligence Index ties it with Kimi K2.6. When weights are released—a move Xiaomi has publicly signaled as "soon"—it would technically become the highest-ranked open-weights model currently available, marginally ahead of the DeepSeek V4 Pro.

Architectural tradeoffs and the cost of token efficiency

MiMo V2.5 Pro utilizes a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters, though only 42 billion are active during any single inference step. This design allows the model to maintain a 1-million-token context window while remaining on the "Pareto frontier" of the Intelligence vs. Cost index.

MiMo V2.5 Pro is considerably more token efficient than models in the same Intelligence tierMiMo V2.5 Pro is considerably more token efficient than models in the same Intelligence tier

The economic breakdown of the model's API usage reveals a pricing structure of $1.00 per million input tokens and $3.00 per million output tokens. In practical terms, running the standard Artificial Analysis Intelligence Index cost approximately $462 on MiMo V2.5 Pro, significantly lower than the $948 required for Kimi K2.6 and the $544 for GLM 5.1.

However, this cost-efficiency is balanced by an increase in resource consumption relative to its own lineage. The V2.5 Pro used approximately 92 million output tokens to complete the intelligence index, a 19% increase over the 77 million tokens used by the V2 Pro. While it remains more efficient than Kimi K2.6 (170M tokens) and GLM 5.1 (110M), the trend suggests that Xiaomi is trading higher token density for its improved agentic scores.

Coding challenge logs reveal a "brittle" agentic strategy

While benchmarks suggest superior agency, real-world programming challenges provide a more nuanced view of how MiMo handles dynamic environments. In the recent Word Gem Puzzle coding contest, the previous MiMo V2 Pro finished in second place, trailing Kimi K2.6 but outperforming frontier models like GPT-5.5.

Detailed move logs from the contest highlight a specific implementation constraint. While Kimi K2.6 utilized an "aggressive greedy loop" to actively slide tiles and solve puzzles, MiMo’s strategy was effectively a "static scanner." The model did not actually move tiles during the challenge; instead, it scanned the initial grid for existing long-form words and claimed them in a single batch.

Breakdown of individual evaluation results for MiMo V2.5 ProBreakdown of individual evaluation results for MiMo V2.5 Pro

This approach proved highly effective on grids where seed words remained intact but failed entirely on larger, more scrambled boards where active tile manipulation was required. For operators, this indicates that MiMo’s high agentic scores may stem from high-speed pattern recognition and execution rather than robust, adaptive problem-solving. It excels when the "path" is visible in the initial state but may struggle in highly fluid environments compared to models with more active "greedy" heuristics.

Factual accuracy regressions and the hallucination gap

Despite the gains in instruction following, MiMo V2.5 Pro shows signs of regression in factual reliability. On the AA-Omniscience Index—a measure of factual accuracy and hallucination—the model’s score dropped to 4, down from the V2 Pro’s score of 5.

The analysis of the model's outputs showed a hallucination rate of 25%, paired with a relatively low accuracy rate of 23%. This suggests that while the model is less likely to "invent" facts than some smaller peers, it frequently fails to retrieve the correct information, leading to a "low-confidence" performance profile in knowledge-heavy tasks.

Compared to proprietary frontier models, MiMo V2.5 Pro remains a tool primarily suited for structured agentic tasks where the environment is well-defined and the penalty for a lack of dynamic adaptation is low. The upcoming release of its weights will be a critical test for the open-weights community, determining if Xiaomi’s specific "static" agentic optimization can be fine-tuned into a more resilient general-purpose solver.

Comments (0)

Sort by

Please login to comment

Sign in to share your thoughts and connect with the community

Loading...

Related news

Google celebrates 20 years of Translate with a new interactive AI pronunciation tool and launches an experimental "Ask YouTube" conversational search feature.

Google Translate Adds AI Pronunciation Practice Tool

580 views•4 min
Turtle Beach's new Command Series peripherals feature customizable touchscreens for macro management and system monitoring. Discover the technical specs and release details.

Turtle Beach Command Series Touchscreen Peripheral Specs

79 views•3 min
Apple announces John Ternus will become CEO on September 1, 2026, while Tim Cook moves to Executive Chairman. An analysis of Apple's hardware-led future.

John Ternus Named Apple CEO as Tim Cook Shifts to Chairman

153 views•4 min
Anthropic Labs debuts Claude Design, a tool using Claude Opus 4.7 to generate interactive prototypes and design systems directly from existing codebases.

Anthropic Claude Design: Prototyping and Code Handoff Analysis

117 views•4 min
The DJI Osmo Pocket 4 introduces 4K/240p slow-motion and improved dynamic range. Here is how the hardware changes impact real-world vlogging and production.

DJI Osmo Pocket 4 Specs: 4K/240p and Improved Dynamic Range

89 views•3 min
Porsche reveals the 2027 911 GT3 S/C, combining the 510 PS naturally aspirated engine with a magnesium-ribbed automatic roof and 6-speed manual transmission.

2027 Porsche 911 GT3 S/C: Specs, Weight, and Analysis

135 views•5 min
Leaks suggest Apple will introduce a Deep Red finish for the iPhone 18 Pro, while Android manufacturers reportedly prepare similar shades for 2026.

iPhone 18 Pro Deep Red Color Leak and Android Response

90 views•3 min
US Treasury Secretary Scott Bessent convenes bank CEOs as Anthropic's Claude Mythos model demonstrates autonomous discovery of critical zero-day vulnerabilities.

Anthropic Mythos Prompts Treasury Meeting with Bank CEOs

276 views•5 min
GitButler, co-founded by GitHub’s Scott Chacon, raises $17M Series A to move software development beyond 20-year-old Git workflows and support AI collaboration.

GitButler Raises $17M to Redesign Version Control for AI

223 views•3 min
As Apple's M5 and Intel's Panther Lake arrive in 2026, the CPU is no longer the center of the chip. Discover how NPUs and specialized accelerators are taking over.

CPU vs NPU: The Shift to Specialized Silicon in 2026

162 views•4 min
Leaked specs for the MediaTek Dimensity 9600 reveal a 5GHz clock speed target, Arm Magni GPU, and TSMC N2p process for 2027 flagship smartphones.

MediaTek Dimensity 9600 Leaks: 5GHz and N2p Architecture

157 views•3 min
Apfel v0.7.2 wraps Apple’s FoundationModels framework in a Swift-based CLI and OpenAI-compatible server for private, 100% on-device AI inference on macOS.

Apfel: Accessing Local Apple Intelligence via CLI and API

151 views•5 min
Google launches Gemma 4, a new generation of open-source models built on Gemini technology. Learn about the technical specs, performance, and how to run it locally.

Google Gemma 4 Launch: Open-Source Models and Local Access

115 views•3 min
The Vivo X300 Ultra's Chinese launch reveals a significant price gap for international buyers. Explore the specs, import costs, and software limitations.

Importing the Vivo X300 Ultra: Costs, Specs, and Risks

128 views•4 min
Recent data reveals a surprising winner in vehicle durability. Learn why standard hybrids are outperforming both electric and gasoline cars in long-term reliability.

Hybrid vs. Electric vs. Gas Car Reliability Explained

130 views•4 min
Technical deep dive into the Axios npm compromise (v1.14.1 and v0.30.4). Analysis of the plain-crypto-js RAT dropper, OIDC bypass, and anti-forensic cleanup.

Technical Analysis: Axios npm Supply Chain Attack

161 views•5 min
As Apple marks 50 years, we examine the cultural and technical shifts that turned a garage startup into a $3.5 trillion titan through eight core product leaps.

Apple at 50: From Garage Startup to $3.5 Trillion Technology Pillar

222 views•3 min
A technical narrative of a 320GB production server failure, focusing on Samsung LRDIMM errors, kernel RAS logs, and the operational cost of technical negligence.

From Morning Crash to Evening Demolition: Proving a 320GB Production Server Failure When Management Derailed

123 views•6 min
Sony increases PlayStation 5 prices by $100, citing AI-driven memory demand and geopolitical instability. The hike affects PS5, PS5 Pro, and PlayStation Portal.

Sony Hikes PlayStation 5 Prices by $100 Amid Surging Memory Costs

134 views•3 min
A technical audit of Alibaba’s AgentScope framework, focusing on its three-layer architecture, four-tier fault tolerance, and multimodal ContentBlock system.

Alibaba AgentScope Technical Deep Dive: AOP and Fault Tolerance

270 views•4 min