Lunar New Year 2026: China's Open-Source LLMs Explode — How to Choose Between Kimi, Qwen, GLM, and MiniMax
Disclaimer: This post is machine-translated from the original Chinese article: https://ai-coding.wiselychen.com/china-ai-models-2026-lunar-new-year-comparison/
The original work is written in Chinese; the English version is translated by AI.
📝 Originally written in Chinese. Read the original →
Author: Wisely Chen Date: February 2026 Series: AI Agent Complete Guide / IT Architecture Series Keywords: open-source model comparison, Agent selection, cost optimization, MoE architecture, Chinese models, benchmark comparison
Why I’m writing this
This Lunar New Year is a watershed moment for China’s AI industry.
After Kimi K2.5 dropped in late January, every major player went all-in in February: Alibaba’s Qwen3.5 (cost -60%), Zhipu’s GLM-5 (Intelligence Index 50+), MiniMax M2.5 (fastest), ByteDance’s Seedance 2.0 (video generation praised by Elon Musk).
In just one month, it felt like everyone collectively hit the “accelerate” button.
But then the question: Which one should you use?
Kimi K2.5’s Agent Swarm for complex automation? Qwen3.5 to slash your costs? GLM-5 or MiniMax for coding?
So instead of going deep only on Kimi K2.5, this post pulls all the Lunar New Year headliners onto the same stage for a head-to-head comparison. The core takeaway:
There is no “best model,” only “the model that fits your scenario.” The price-performance ratio of open-source models is now approaching top-tier international closed-source models — but only if you know how to pick.
The four contenders in 30 seconds
| Dimension | Kimi K2.5 | Qwen3.5 | GLM-5 | MiniMax M2.5 |
|---|---|---|---|---|
| Total Params | 1T (MoE) | 397B (MoE) | 744B (MoE) | 230B |
| Active Params | 32B (3.2%) | 170B | 400B | - |
| Context | 256K | 256K | 256K | 256K |
| Input Cost | $0.60/M | ~$0.20/M | ~$0.30/M | ~$0.15/M |
| Key Strength | Agent Swarm + Multimodal | 60% cheaper | Reasoning | Speed + Tool Calling |
| License | Modified MIT | MIT | MIT | MIT |
| Vendor | Moonshot AI | Alibaba | Zhipu AI | MiniMax |
One-liner version:
- Kimi K2.5: 100 sub-agents working in parallel, vision and video understanding, strongest on Agent tasks
- Qwen3.5: Cheapest, 60% lower cost than previous gen, first pick for large-scale enterprise deployments
- GLM-5: Strongest reasoning, Intelligence Index 50+, on par with Claude Opus 4.5
- MiniMax M2.5: Fastest, most accurate Tool Calling, ideal for high-throughput scenarios
Kimi K2.5: The Agent Swarm revolution
I’ve already written a detailed technical assessment of Kimi K2.5 in another post. Here I’ll only cover the key differences when compared against the other three.
The signature weapon: Agent Swarm
The most fundamental difference between Kimi K2.5 and the other three models is its native group-collaboration capability.
A traditional Agent executing 50 sub-tasks linearly takes 50 minutes. Kimi’s Orchestrator can break tasks into a DAG graph and dispatch up to 100 specialized sub-agents running in parallel, with up to 1,500 tool calls per single task. The same 50 tasks finish in about 11 minutes.
The BrowseComp numbers make it concrete: standard mode 60.6%, Swarm mode shoots up to 78.4%.
Native multimodality
Among the four models, only Kimi K2.5 was trained with vision and text mixed from day one. It integrates the 400M-parameter MoonViT encoder, so it doesn’t have to “translate” images into text before reasoning.
VideoMMMU 86.6%, beating GPT-5.2’s 85.9%. None of the other three models can touch this level of video understanding.
To be clear: audio is not a native capability of K2.5. Moonshot AI has a separate Kimi-Audio model, and the app side stitches them together.
Where it fits
Complex Agent automation (intelligence analysis, competitive monitoring), visual coding (UI mockup to code), and applications that need multimodality. Input cost is $0.60/M tokens — 12% of Claude’s.
Qwen3.5: The new king of the cost war
Release date: February 16, 2026 Vendor: Alibaba
Qwen3.5 is the most ambitious pricing killer of this Lunar New Year season. 397B parameters with an MoE architecture activating 170B during inference.
The cost advantage: how do they pull off -60%?
Not marketing fluff — it’s real compute-efficiency gains:
- FP8 quantization + MoE optimization: 8-bit low precision plus mixture-of-experts selection brings inference cost way down
- 8x throughput: same GPU resources can handle 8x the concurrent requests
- API pricing: input cost around $0.20/1M tokens (one-third of Kimi)
For enterprises: with the same $10,000 budget, you can run 3x the workload of Kimi on Qwen3.5, or 25x the workload of Claude.
Qwen3.5-Coder: the on-prem option for coding
Released alongside it, Qwen3.5-Coder-Next has 233B parameters, activates roughly 30B per inference, and runs locally on a single H100.
What that means: enterprises can deploy directly on their own servers — the entire code review process never leaves the building, and it’s 40% cheaper than the API on top of that. For the “data privacy + coding needs” scenario, this is currently the most pragmatic option.
Where it fits
Cost-sensitive enterprise AI applications, large-scale RAG (256K context), long-document analysis, on-prem deployed coding tools.
GLM-5: The ceiling of reasoning
Release date: February 11-12, 2026 Vendor: Zhipu AI
GLM-5 is the most “hardcore” release of the holiday season. 744B parameters, activating 400B per inference — that active parameter count alone is more than double Qwen3.5’s total active parameters (170B).
First open-source model to break Intelligence Index 50
On Zhipu’s published Intelligence Index, GLM-5 is the first open-source model to hit 50+:
- GLM-5: 50.2
- Claude Opus 4.5: 50.0
- GPT-5.2: 49.8
It’s a weighted average across 7 standardized benchmarks. Plainly: GLM-5’s reasoning is now on par with top-tier international closed-source models.
Coding ability
HumanEval hits 92.8% — the strongest code generator of the four models. LiveCodeBench 89.2%, also the highest. If your need is “write code” rather than “fix code,” GLM-5 is the best pick.
(“Fix code” still belongs to Claude, with its rock-solid SWE-Bench 80.9%.)
Where it fits
Complex reasoning tasks, coding applications (code generation and review), and scenarios where you need “not to lose to the international top tier.” Input cost around $0.30/M tokens.
MiniMax M2.5: The sweet spot between speed and precision
Release date: February 11-12, 2026 Vendor: MiniMax
MiniMax M2.5 is the most “low-key” yet most “pragmatic” option. It’s only 230B parameters — much smaller than the other three — but it shines in real-world applications.
The most accurate Tool Calling
MiniMax scores 77.2% on τ-Bench (Tau-Bench), beating every competitor. This benchmark specifically tests the model’s ability to call external tools: understanding when a tool is needed, filling in arguments correctly, reasoning over returned results, and handling failed calls.
In Agent frameworks like OpenClaw, Dify, and n8n, the accuracy of Tool Calling directly determines the success rate of your automation pipelines.
Speed and cost
The smaller parameter count pays off in real ways:
- Highest throughput: handles the most concurrent requests on the same hardware
- Lowest latency: shortest time-to-first-token
- Lowest cost: input cost around $0.15/1M tokens, 33x cheaper than Claude
Where it fits
High-throughput Agent applications (precise Tool Calling), batch processing, extreme cost optimization, real-time applications that need low latency.
Full benchmark comparison
| Domain | Benchmark | Kimi K2.5 | GLM-5 | Qwen3.5 | MiniMax M2.5 | GPT-5.2 | Claude Opus |
|---|---|---|---|---|---|---|---|
| Agent Collaboration | HLE-Full (w/ Tools) | 50.2% | 48.6% | 47.2% | 45.1% | 45.5% | 43.2% |
| Agent Search | BrowseComp | 78.4% | 72.3% | 68.5% | 71.2% | 65.8% | 57.8% |
| Tool Calling | τ-Bench | 74.6% | 76.1% | 74.8% | 77.2% | 72.3% | 68.9% |
| Code Repair | SWE-Bench | 76.8% | 77.2% | 75.4% | 71.3% | 80.0% | 80.9% |
| Code Generation | HumanEval | 87.3% | 92.8% | 89.1% | 85.6% | 90.2% | 88.4% |
| Visual Math | MathVision | 84.2% | 81.5% | 79.8% | 77.2% | 83.0% | N/A |
| Pure Math Reasoning | AIME 2025 | 96.1% | 95.2% | 94.8% | 93.1% | 100% | 92.8% |
| Long Video Understanding | VideoMMMU | 86.6% | N/A | N/A | N/A | 85.9% | 82.1% |
| Live Coding | LiveCodeBench | 85.0% | 89.2% | 84.5% | 82.1% | 87.3% | 64.0% |
| Chinese Understanding | CMMLU | 84.5% | 92.1% | 89.3% | 86.2% | 80.2% | 79.1% |
Where each model wins
- Kimi K2.5: First in Agent collaboration, agent search, and multimodal (vision and video) — across the board
- GLM-5: First in code generation (92.8%), live coding (89.2%), and Chinese understanding (92.1%) — three categories
- MiniMax M2.5: First in Tool Calling (77.2%), best on speed and cost
- Qwen3.5: No standout #1 anywhere, but balanced across the board and cheapest of all
“Strongest” doesn’t mean “most suitable”
Take SWE-Bench:
- Claude 80.9% — you’re shipping Linux kernel patches and need close to 100% correctness? Pick Claude
- GLM-5 77.2% — startup writing business logic, 77% is plenty, and it’s 17x cheaper
- Kimi 76.8% — you need to code while looking at design docs (multimodal)? Kimi is the only choice
The question isn’t “who’s strongest” — it’s “what is your scenario most sensitive to.”
Cost comparison: what enterprises actually care about
| Model | Input ($/1M) | Output ($/1M) | vs. Claude input |
|---|---|---|---|
| MiniMax M2.5 | ~$0.15 | ~$0.50 | 33x cheaper |
| Qwen3.5 | ~$0.20 | ~$0.80 | 25x cheaper |
| GLM-5 | ~$0.30 | ~$1.20 | 17x cheaper |
| Kimi K2.5 | $0.60 | $2.50 | 8x cheaper |
| GPT-5.2 | $1.25 | $10.00 | 4x cheaper |
| Claude Opus 4.5 | $5.00 | $25.00 | baseline |
Takeaway: same budget, Kimi K2.5 gets you 8x the workload (vs. Claude). Pick MiniMax or Qwen and you’re looking at 25-33x.
On-prem deployment cost
| Model | Active Params | GPU Required | Monthly Cost | Where it fits |
|---|---|---|---|---|
| Qwen3.5-Coder | 30B | 1x H100 | ~$2,500 | Coding teams, data privacy |
| MiniMax M2.5 | - | 1x A100 | ~$1,800 | High-throughput apps |
| Kimi K2.5 | 32B | 1x H100 | ~$2,500 | Agent Swarm apps |
| GLM-5 | 400B | 2x H100 | ~$5,000 | Reasoning-heavy tasks |
If monthly traffic exceeds 100M tokens, on-prem is actually cheaper than API. For enterprises, that’s the dividing line.
Selection matrix by use case
| Use Case | First Pick | Runner-up | Why |
|---|---|---|---|
| Agent automation (complex) | Kimi K2.5 | GLM-5 | Native parallel Agent Swarm |
| Agent automation (simple, high volume) | MiniMax M2.5 | Qwen3.5 | Accurate Tool Calling + fast |
| Coding + code review | GLM-5 | Qwen3.5-Coder | Highest HumanEval 92.8% |
| Long-document RAG | Qwen3.5 | Kimi K2.5 | Cheapest + 256K context |
| Multimodal (image/video) | Kimi K2.5 | — | Only one with native video understanding |
| Extreme cost optimization | MiniMax M2.5 | Qwen3.5 | Input $0.15/M |
| Reasoning-heavy | GLM-5 | Kimi K2.5 | Intelligence Index 50+ |
| On-prem (data sovereignty) | Qwen3.5 | GLM-5 | MIT open-source + low hardware bar |
My real-world results on OpenClaw
Benchmark numbers above. Real scenarios here. I swapped the lobster from Opus 4.6 to Kimi K2.5 and ran it for a few days.
| Dimension | Opus 4.6 | Kimi K2.5 | Delta |
|---|---|---|---|
| Chinese understanding | Top tier | Top tier | No diff |
| Instruction precision | Top tier | Near top tier | Occasionally misses edge cases |
| Task completion | 95% | 93% | -2% (acceptable) |
| Response speed | Medium | Fast | +30% |
| Cost | 1x | 0.2x | 80% savings |
Scenario 1: Structured extraction from a 50-page PDF Opus was flawless, $2.50. Kimi hit 97% correctness, $0.25. 10x cost savings.
Scenario 2: Code review of 500 lines of Python Opus surfaced 8 issues. Kimi surfaced 7, missing 1 edge case — not production-affecting.
Scenario 3: Complex business-logic system design Opus offered 5 perspectives. Kimi offered 4, missing 1. 95% satisfying, but not 100%.
Conclusion: trading 2-3% of perfection for 80% cost savings is an absolute steal for Agent applications.
Honestly: the three traps in model selection
Trap 1: Getting fooled by benchmarks
GLM-5 Intelligence Index 50.2 vs. Claude Opus 50.0 — the number differs by 0.2, but cost differs by 17x. Don’t let the ranking hypnotize you. Look at your use case, not the leaderboard.
Trap 2: Chasing “perfection” too hard
If you need 99.99% perfection, don’t look at open-source models. But if 95-98% is enough (and it is for most scenarios), the cost savings from open-source let you try 100 more new ideas.
Trap 3: Ignoring deployment freedom
Qwen3.5 and GLM-5 are both MIT open-source. Deploying on your own servers = data sovereignty + zero fear of vendor price hikes. For enterprises, that value may exceed the raw smartness of the model itself.
Closing
Lunar New Year 2026, China’s open-source LLMs have collectively entered their “maturity phase.”
Kimi K2.5, Qwen3.5, GLM-5, and MiniMax M2.5 aren’t in a competition with each other — they’re in a selection relationship. Each tops a different dimension:
- Kimi: Agent and multimodal
- Qwen: Cost and deployment freedom
- GLM: Reasoning and Chinese understanding
- MiniMax: Speed and tool calling
The truth underneath: the parameter era is over, the MoE (mixture-of-experts) era has arrived. No more racing on “total parameters” — now it’s about “activation efficiency” and “application fit.”
My current setup: Kimi K2.5 for core Agent apps, MiniMax M2.5 for high-throughput tasks, GLM-5 for coding, Qwen3.5 for cost-sensitive enterprise apps.
Not as backups. This is the main lineup.
References
Kimi K2.5
- One Hundred Agents, One Command - Kimi K2.5 Automation
- MoonshotAI/Kimi-K2.5 - GitHub
- Kimi K2.5 Tech Blog - Moonshot AI
- Kimi K2.5 API Quickstart
Qwen3.5
GLM-5
MiniMax
Comparative analyses
- LLM Benchmark Comparison - Artificial Analysis
- China Open-source LLM Lunar New Year Release Roundup - Zhihu Column
- Open-source LLM Cost Analysis Report - AI Commons