Qwen 3.6-27B Local Deployment: Sonnet 4.6-Class AI Agent Running on a DGX Spark / Mac mini
Disclaimer: This post is machine-translated from the original Chinese article: https://ai-coding.wiselychen.com/qwen-3-6-27b-sonnet-level-home-inference/
The original work is written in Chinese; the English version is translated by AI.
📝 Translation Note
This article is translated from the Chinese original. The author writes primarily in Traditional Chinese, and the original version is the canonical source — including the latest updates, comments, and follow-up discussions.
👉 Read the Chinese original: Qwen 3.6-27B 本地部署:DGX Spark / Mac mini 跑出 Sonnet 4.6 等級 AI Agent
If you spot any translation errors or have feedback, please refer to the Chinese version as the source of truth.

Table of Contents
- TL;DR
- Why this is yet another “boring IT architecture” post
- Part 1: Why now? Why Qwen 3.6-27B?
- Part 2: Qwen 3.6-27B benchmarks beat Opus 4.5
- Part 3: What hardware are people running this on, and how fast?
- What this means for enterprise IT architects
- FAQ
TL;DR
- Qwen 3.6-27B (dense, Apache 2.0) wins 7 of 12 benchmarks against Claude Opus 4.5, ties on 1, and lands at Sonnet 4.6 class on SWE-bench / Terminal-Bench
- One developer hit 136 tokens/sec on an NVIDIA DGX Spark ($4,699, 49W draw), peaking at 209 t/s with 10 parallel agents
- A “Sonnet 4.6-class AI coding agent” no longer needs a cloud API — it runs on a box smaller than a microwave under your desk
- What IT architects should do: redo the on-prem AI Coding ROI math. The assumptions you made 6 months ago are stale.
Why this is yet another “boring IT architecture” post
Another entry in the “boring IT architecture” series.
Over the past year, every local LLM discussion has been about performance — which model to run, which quantization to use, which GPU to buy. But for enterprise IT architecture, only one question really matters:
“Is local actually good enough to replace the Anthropic / OpenAI API?”
The answer used to be “not even close.” Six months ago, if you told your CTO “let’s run our Claude Code workload on-prem,” they’d have laughed at you — local model quality was a generation behind the commercial APIs and engineers wouldn’t touch them.
But on April 22, 2026, when the Qwen team dropped Qwen 3.6-27B, the answer changed.
This post isn’t a benchmark flex. It answers what IT architects actually care about:
- What class is Qwen 3.6-27B at, on benchmarks? Answer: Sonnet 4.6 class. Not a metaphor — the actual numbers.
- Can you actually run it? On what hardware? Answer: a $4,699 NVIDIA DGX Spark, sits under your desk, 49W draw — barely more than an LED bulb.
- Is your on-prem AI Coding architecture diagram still valid? Answer: no. Time to redraw it.
Part 1 | Why now? Why Qwen 3.6-27B?
A few terms first, because all of this has been moving fast for the last 6 months.
What is Qwen 3.6-27B
Qwen 3.6-27B is the latest open-source model from Alibaba’s Qwen team, released in April 2026:
- Architecture: Dense (not MoE), all 27B parameters active
- Context: Native 262,144 tokens, extendable to 1M (YaRN)
- Quantization: Official FP8 build (~27GB VRAM), plus 54 community quants (llama.cpp, LM Studio, Ollama, etc.)
- License: Apache 2.0 (no commercial restriction)
- Capabilities: Text + vision (multimodal), native tool calling
The architectural choice itself is a signal.
In 2025 the open-source world collectively pivoted to MoE — the Qwen team themselves shipped a 397B-A17B MoE flagship as Qwen 3.5. But for the 3.6 generation, they put the strongest agentic coding capability back into a 27B dense. Why?
Because dense 27B is exactly the sweet spot for “one consumer GPU / one workstation.” MoE has fewer active parameters but more total parameters, which means more VRAM. Dense 27B at FP8 only needs 27GB — DGX Spark’s 128GB unified memory has plenty of headroom, and even an RTX 5090 (32GB) fits it.
The Qwen team’s bet is clear: the battle for agentic coding is on-device, not in the cloud.
Why this timing matters
Lay the last 18 months of milestones side by side:
| Date | Event | Significance |
|---|---|---|
| 2024 Q4 | Local LLMs limited to demos | Quality a generation behind commercial APIs |
| 2025 Q1 | Qwen 3.5-27B makes local tool calling usable | First time local could go to production |
| 2025 Q3 | Claude Opus 4.5 released | Anthropic flagship, state-of-the-art |
| 2025 Q4 | Mac Studio M3 Ultra runs 70B models | Consumer hardware caught up |
| 2026 Q2 (now) | Qwen 3.6-27B hits Sonnet 4.6 class on a home machine | All on-prem ROI assumptions need to be redone |
Chris Maddern’s observation on X is sharp:
“Opus 4.5 was released 5 months ago, the gap is closing. Opus 4.5 was the breakthrough moment for ‘good enough to stop writing code’… real local coding inference is coming.”
We used to talk about a “6-month frontier gap” — open-source models were typically half a year behind commercial ones. That gap has compressed to a single quarter, and on some axes it’s flat.
For enterprise IT, this isn’t trivia. It’s an input to architectural decisions, and the input just changed.
Part 2 | Qwen 3.6-27B benchmarks beat Opus 4.5
The Qwen team published 12 official benchmarks. Who’s the field of comparison?
- Qwen 3.5-27B (the previous-gen dense)
- Gemma4-31B (Google’s dense counterpart)
- Qwen 3.6-35B-A3B (their own MoE variant)
- Qwen 3.5-397B-A17B (the previous-gen flagship MoE)
- Claude 4.5 Opus (Anthropic’s flagship at the time)
Let’s go straight to the results.
Full benchmark comparison
| Benchmark | Category | Qwen 3.6-27B | Qwen 3.5-397B-A17B | Claude 4.5 Opus | Winner |
|---|---|---|---|---|---|
| Terminal-Bench 2.0 | Agentic Terminal | 59.3 | 52.5 | 59.3 | Tied with Opus |
| SWE-bench Pro | Agentic Coding | 53.5 | 50.9 | 57.1 | Opus |
| SWE-bench Verified | Agentic Coding | 77.2 | 76.2 | 80.9 | Opus |
| SWE-bench Multilingual | Multilingual Coding | 71.3 | 69.3 | 77.5 | Opus |
| QwenClawBench | Real-World Agent | 53.4 | 51.8 | 52.3 | Qwen |
| QwenWebBench (Elo) | Artifacts | 1487 | 1186 | 1536 | Opus |
| NL2Repo | Long-Horizon Coding | 36.2 | 32.2 | 43.2 | Opus |
| SkillsBench | Agent Skills | 48.2 | 30.0 | 45.3 | Qwen |
| Claw-Eval (pass^3) | Real-World Agent | 60.6 | 48.1 | 59.6 | Qwen |
| GPQA Diamond | Graduate Reasoning | 87.8 | 88.4 | 87.0 | Qwen (slim) |
| MMMU | Multimodal Reasoning | 82.9 | 85.0 | 80.7 | Qwen |
| RealWorldQA | Image Reasoning | 84.1 | 83.9 | 77.0 | Qwen |
Score: Qwen 3.6-27B wins 7, ties 1, loses 4 vs Claude 4.5 Opus.
A single open-source 27B dense model beating Anthropic’s 5-month-old flagship on 7 of 12 official benchmarks.
What does the pattern tell us?
The 4 benchmarks where Qwen 3.6-27B loses:
- SWE-bench Pro: 53.5 vs 57.1 (–3.6)
- SWE-bench Verified: 77.2 vs 80.9 (–3.7)
- SWE-bench Multilingual: 71.3 vs 77.5 (–6.2)
- NL2Repo: 36.2 vs 43.2 (–7.0)
All losses are in pure coding and long-horizon repo reasoning. No surprise — Opus 4.5 was a flagship optimized for writing code. The biggest gap (NL2Repo, long-span repo comprehension) is exactly where Opus has the most natural advantage.
The 7 benchmarks where Qwen 3.6-27B wins:
- QwenClawBench (real-world agent tasks): 53.4 vs 52.3
- SkillsBench (agent skills): 48.2 vs 45.3
- Claw-Eval (real-world agent pass^3): 60.6 vs 59.6
- GPQA Diamond (graduate-level reasoning): 87.8 vs 87.0
- MMMU (multimodal reasoning): 82.9 vs 80.7
- RealWorldQA (image reasoning): 84.1 vs 77.0 (+7.1)
The wins are in real-world agent tasks, agent skills, reasoning, and multimodal.
That’s exactly the capability mix an AI coding agent actually needs — not “write a single PR in one shot,” but long-running, tool-using, screen-aware, reasoning-heavy workloads.
The Sonnet 4.6 comparison — where the title comes from
Opus 4.5 (November 2025) is overkill for most enterprise AI coding workloads. In Q1 2026, Anthropic released Claude Sonnet 4.6 — that’s the model Claude Code is actually running for most daily work today.
Side by side:
| Benchmark | Qwen 3.6-27B | Claude Sonnet 4.6 | Gap |
|---|---|---|---|
| SWE-bench Verified | 77.2 | 79.6 | Sonnet +2.4 |
| Terminal-Bench 2.0 | 59.3 | 59.1 | Qwen +0.2 |
Not a metaphor. Literal numbers:
- Terminal-Bench 2.0: Qwen 3.6-27B edges out Sonnet 4.6.
- SWE-bench Verified: trails Sonnet 4.6 by only 2.4 points — within the margin of statistical noise.
Sonnet 4.6 API pricing: $3 / $15 per million tokens (input / output). A heavy AI coding engineer using 5M input + 1M output tokens per day burns $30/day, or $7,500/year.
That’s money you no longer have to spend.
Part 3 | What hardware are people running this on, and how fast?
Beating Opus 4.5 on benchmarks is half the story. The other half — the half enterprise IT actually cares about — is can you run it, and on what?
The data point comes from this tweet
On April 22, 2026, X user Mitko Vasilev (@iotcoi) posted a terminal screenshot:
“Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136 t/s average on a single tiny GB10 GPU at 49W power”
User LotusDecoder quoted it and added (translated from Chinese):
“Beautiful — a small home desktop running Qwen 3.6-27B-FP8 inference at 136 tokens/sec. Performance is probably approaching Haiku 4.5.”
(Actually an underestimate — it’s Sonnet 4.6 class.)
The three components — hardware + software — are worth unpacking.
Hardware: NVIDIA DGX Spark / GB10
NVIDIA DGX Spark is NVIDIA’s 2026 “home AI workstation.” Specs:
| Item | Spec |
|---|---|
| SoC | NVIDIA GB10 Grace Blackwell Superchip |
| CPU | 20-core ARM (10 × Cortex-X925 @ 4GHz + 10 × Cortex-A725 @ 2.8GHz) |
| GPU | 6,144 CUDA cores, Blackwell architecture |
| Memory | 128GB LPDDR5X unified memory |
| Storage | 4TB NVMe SSD |
| Network | ConnectX 200 Gbps (two Sparks can interconnect to run 405B models) |
| FP4 perf | 1 petaFLOP |
| Price | $4,699 USD |
| Form factor | Desktop, slightly smaller than a Mac Studio |
The key is the GB10 Superchip’s unified memory — all 128GB is GPU-addressable. Qwen 3.6-27B FP8 only takes 27GB; the remaining 100GB can be used for a massive KV cache for long context, or to keep multiple models loaded for hot-swapping.
The hardware is positioned clearly: it’s not trying to replace H100s in a data center — it’s trying to replace the MacBook Pro on an engineer’s desk. Give every AI coding engineer one. Put it under their desk. Their agents run on their hardware.
Software: What are Dflash and DDTree?
Mitko’s tweet mentions Dflash + DDTree — that’s the inference acceleration secret.
- DFlash: Block Diffusion Flash Speculative Decoding. In short, a small draft model “predicts” a whole block of candidate tokens at once, and the target model verifies them in a single forward pass. From the z-lab open-source project.
- DDTree: An improved version of DFlash that arranges candidate tokens into a tree, producing multiple candidate paths in the same draft pass and picking the best at verification time. The paper measures 4.84× → 6.90× speedup on Qwen3-8B HumanEval, and 4.78× → 6.75× on GSM8K.
What this combination means: without changing hardware, pure software optimization lifts token throughput by 6–8x on the same GPU.
That’s where Mitko’s 136 t/s comes from. Without DFlash + DDTree, the same hardware would only do 20–30 t/s.
Is the throughput actually enough?
136 t/s spread across 10 parallel agents = roughly 20 t/s per agent.
Reference points:
| Scenario | Throughput | Feel |
|---|---|---|
| Human reading speed | ~5 t/s | Slow enough to read along |
| Claude Sonnet 4.6 API (typical) | ~50–80 t/s | Fast, but rate-limited |
| Mitko’s DGX Spark (per agent) | ~20 t/s | 4x faster than reading |
| Mitko’s DGX Spark (peak, all agents) | 209 t/s | Only saturates with 10 parallel agents |
In other words, a single agent at 20 t/s is slightly slower than the Claude API, but 10 agents can run simultaneously with no rate limiting. For autonomous agent workflows, subagents, and cron-driven background tasks, total throughput beats the API.
Power: what 49W means
What’s 49W in context?
- One LED bulb: ~10W
- MacBook Pro at full tilt: ~100W
- DGX Spark running Qwen 3.6-27B agent workload: 49W
- RTX 4090 single-card LLM: ~400W
- A100 server: ~400–700W
8 working hours/day × 250 days × 49W = 98 kWh/year. At Taiwan industrial rates, that’s about NT$300/year — call it $10/year USD.
The number is small enough to not even register in any cost model. Compared to the API’s $7,500/year, electricity is a rounding error.
What about Mac Studio? Rapid-MLX numbers
Mac Studio is more common in enterprises than DGX Spark. The Mac numbers landed three days later. Apple Silicon inference engine Rapid-MLX (Apache 2.0, OpenAI-compatible API) shipped Day-0 Qwen 3.6 support in v0.6.1. On a Mac Studio M3 Ultra (256GB) it gets 4-bit at 36.5 t/s (14.9GB used) and 8-bit at 18.9 t/s (32.3GB used) — coding eval 100% pass, stress test 8/8.
These numbers are revealing in the other direction: DGX Spark’s 136 t/s isn’t a hardware blowout — it’s the DFlash + DDTree software stack. Without acceleration, GB10 itself does only ~15 t/s, in the same league as M3 Ultra. For IT architects this is actually very direct: if your company already gives engineers Mac Studios (a lot of Silicon Valley shops do), you don’t need to buy DGX Sparks. pip install rapid-mlx and you’re running Qwen 3.6-27B. 36.5 t/s is plenty for single-engineer interactive coding.
Cost structure side by side
Per-developer math:
| Option | Year 1 | Year 2 | Year 3 | 3-year TCO |
|---|---|---|---|---|
| Claude Sonnet 4.6 API | $7,500 | $7,500 | $7,500 | $22,500 |
| DGX Spark + Qwen 3.6-27B | $4,699 + $10 elec | $10 | $10 | $4,729 |
3-year per-developer TCO delta: $17,771.
Scaled up:
- 10-person AI coding team: 3-year savings $177,710
- 100-person team: 3-year savings $1,777,100
But the dollars aren’t the real story. The real story is:
- Data stays in-house — every line of code, every prompt, every output stays on the corporate network. Finance, healthcare, defense, and legal can finally use AI coding.
- No API rate limits — autonomous agents run when they need to, no queueing.
- No single-vendor dependency — Anthropic triples prices tomorrow, you don’t care.
What this means for enterprise IT architects
If you’re a CTO / VP of Engineering / IT architecture lead, the actual ask of this article is:
Three checkpoints
- What was your reason 6 months ago for refusing on-prem AI coding?
- “Model quality is a generation behind commercial APIs”? → Now less than a quarter behind, and tied on some metrics
- “Hardware too expensive, payback period too long”? → $4,699 per box, 76% 3-year ROI
- “Engineers won’t want to use it”? → They’ll want Sonnet 4.6 class. The gap is tooling, not quality.
- Are those reasons still valid in April 2026?
- If 2 or more are no longer valid, your architectural assumptions are stale.
- What does a pilot cost?
- One DGX Spark: $4,699
- One engineer’s weekend: install Qwen 3.6-27B, hook it into Claude Code or aider, run it for a week
- Total under $6,000 — below most enterprise IT’s “no approval needed” threshold.
Caveats — being honest
This isn’t a fluff piece for open-source models. The caveats matter:
- For top-tier pure coding quality, Opus 4.5 / Opus 4.6 still win. SWE-bench Pro is 3.6 points apart, NL2Repo is 7 points — critical tasks should still fall back to commercial APIs.
- Qwen is open-sourced by a Chinese team. Apache 2.0 itself has no geopolitical restriction, but specific compliance environments (government, defense, certain financial products) will still get blocked by legal.
- Dflash + DDTree isn’t a mainstream inference stack. vLLM / TGI are still catching up on integrating block diffusion speculative decoding — production deployment has engineering learning costs.
- 136 t/s is a best-case configuration. Different prompt, different quantization, you might get 80 t/s — measure for yourself.
- Official benchmarks have marketing baked in. Qwen’s own numbers ≠ real performance on your workload.
But even after discounting all the caveats, the conclusion still stands: a Sonnet 4.6-class AI coding agent can run on the small machine on your desk.
How does the architecture diagram update?
Back to the question this “boring IT architecture” series keeps asking — what should the architecture look like?
Old on-prem AI coding architecture (2024–2025):
| Layer | Hardware / Service | Notes |
|---|---|---|
| Compute | Central GPU server (4× A100) | ~$100K capex |
| Access | SSH / internal API | Engineers competing for GPU quota |
| Bottleneck | Parallel agents don’t fit | Rate limiting happens internally |
New on-prem AI coding architecture (2026):
| Layer | Hardware / Service | Purpose |
|---|---|---|
| Engineer layer (80% of tasks) | Personal desktop AI workstation per engineer (see table) | Local Qwen 3.6-27B, autonomous agents run unbounded, no rate limits |
| Department layer (long-repo reasoning) | Central GPU server running 70B+ models | Cross-repo, long-context tasks |
| Critical fallback | Claude Opus / Sonnet API | Used for critical tasks when compliance allows |
| Data layer | Langfuse audit trail | All prompts + responses stay on the corporate network |
Engineer-layer hardware options (running Qwen 3.6-27B):
| Hardware | Memory | Bandwidth | Quant | Expected speed | Price (USD) | Use case |
|---|---|---|---|---|---|---|
| NVIDIA DGX Spark | 128GB unified | LPDDR5X | FP8 (27GB) + large KV cache | ~136 t/s (with Dflash+DDTree, 10 parallel agents) | $4,699 | Heavy agent workload, parallel throughput |
| Mac Studio M3 Ultra 256GB | 256GB unified | ~819 GB/s | 4-bit / 8-bit (Rapid-MLX measured) | 36.5 t/s (4-bit) / 18.9 t/s (8-bit) | ~$5,599+ | macOS ecosystem, interactive + medium agent parallelism |
| Mac mini M4 Pro 64GB | 64GB unified | 273 GB/s | 4-bit / 8-bit | ~12–18 t/s (estimated, limited community data) | ~$2,199 | Solo developer, interactive use |
| Mac mini M4 Pro 48GB | 48GB unified | 273 GB/s | 4-bit / 8-bit (watch context size) | ~12–18 t/s (estimated) | ~$1,799 | Budget-conscious, short-context tasks |
| Mac mini M4 base | up to 32GB | 120 GB/s | 4-bit only (~15GB) | Notably slow | ~$1,299 | Quality compromise — not recommended |
Selection rule of thumb:
- Need to run autonomous agents (10 parallel, background tasks) → DGX Spark, with the Dflash+DDTree inference stack the parallel throughput crushes any Mac
- Solo heavy coding + macOS ecosystem → Mac Studio M3 Ultra, 4-bit at 36.5 t/s is already faster than DGX Spark without DFlash, and
pip install rapid-mlxgets you running in one line - Solo interactive coding (one or two agents in your IDE) → Mac mini M4 Pro 64GB is enough, half the price
- Budget < $1,500 → either wait for a used DGX Spark or stay on Claude Code subscription. Base Mac mini running 4-bit 27B isn’t Sonnet 4.6 class anymore.
This isn’t about replacing the commercial APIs entirely — it’s about moving 80% of daily agentic coding workload back inside the company, and leaving the hard 20% to Claude / GPT.
The cost is 20–30% of what you used to pay, data sovereignty is 100% yours, and you no longer ride the API vendor’s mood swings.
FAQ
Q: We’re in finance / healthcare / government. Qwen is a Chinese model — can we use it?
Apache 2.0 has no geopolitical restriction in itself, but legal in highly regulated industries will have concerns about “weights trained by a Chinese team.” Two practical paths: (1) wait for the next Llama / Mistral generation to catch up (historically 2–3 months behind), or (2) pilot Qwen 3.6-27B for internal non-sensitive tools, prove the workflow works, then evaluate. Specific compliance depends on your regulator’s guidance for open-source AI models.
Q: Can’t get a DGX Spark, or budget won’t approve it. Alternatives?
Depends on budget and intensity:
- $2,000–$2,500 personal: Mac mini M4 Pro 64GB, enough for solo interactive use (12–18 t/s). Tighter budget can drop to 48GB but context window narrows.
- $4,000–$5,000 workstation: Mac Studio M3 Ultra (128/256GB unified) running Qwen 3.6-27B. Rapid-MLX measured 4-bit 36.5 t/s and 8-bit 18.9 t/s. macOS ecosystem is friendlier for most engineers.
- Existing Windows / Linux workstation upgrade: RTX 5090 (32GB GDDR7) on a single card runs FP8 27B; with vLLM, throughput hits 100+ t/s. Good fit for teams with desktop hardware in place.
For agent parallelism and maximum t/s, DGX Spark is still the most cost-effective — 128GB unified memory + FP4 petaFLOP is something Mac mini / Mac Studio can’t match.
Q: Is the Dflash + DDTree inference stack stable enough for production?
Not yet mainstream. For production, wait a quarter — vLLM / SGLang are integrating block diffusion speculative decoding and there’ll be more mature deployment paths soon. Right now (April 2026) it works, but you’ll be debugging it yourself.
Q: So is Claude / Anthropic finished?
No. Opus 4.6 / Opus 4.7 are still the strongest coding models, and critical tasks will still run on commercial APIs. But Anthropic has lost “you have no choice” pricing power — that’s a structural change. Future API pricing pressure will increase, or commercial models will need to build deeper moats around agentic workflows / tool ecosystems / enterprise features. For enterprise IT, this is good news.
Q: Why didn’t Qwen go MoE for 3.6? What’s the advantage of dense 27B?
Qwen themselves shipped a 397B-A17B MoE flagship in the 3.5 generation, but for 3.6 they put the strongest agentic coding capability back into 27B dense. The reason: while MoE has fewer active parameters, total parameter count is higher and so is VRAM, which doesn’t fit “consumer GPU / desktop workstation” — the deployment sweet spot. Dense 27B at FP8 is just 27GB VRAM, and an RTX 5090 (32GB), Mac mini M4 Pro 64GB, or DGX Spark 128GB all fit it. The Qwen team’s bet: agentic coding’s battlefield is on-device, not in the cloud.
Q: 3-year TCO of on-prem Qwen 3.6-27B vs Claude API — what’s the gap?
For a heavy AI coding engineer (5M input + 1M output tokens/day): Sonnet 4.6 API is $7,500/year per person, $22,500 over 3 years. DGX Spark + Qwen 3.6-27B is $4,699 hardware + ~$30 of electricity over 3 years = $4,729 TCO. Per-person savings: $17,771. 10-person team: $177,710. 100-person team: $1,777,100. But the real value isn’t just dollars — it’s also data not leaving the company, no API rate limits, and no single-vendor dependency.
Closing
Back to that terminal screenshot from the tweet: 10 agents running in parallel, the green number resting at 209 t/s, and underneath it, 49W.
None of the individual numbers are earth-shattering. But put them together — 27B dense open-source, Sonnet 4.6-class benchmarks, a $4,699 desktop machine, 49W power draw, 10 parallel agents — and you get a structural inflection point.
Six months ago, enterprise IT architects had to explain to their boss “why we should pay Anthropic $7,500/year per engineer.”
Six months from now, the question flips: “Why are we still paying Anthropic $7,500/year per engineer instead of buying a $4,699 DGX Spark?”
That’s the math the “boring IT architecture” series keeps doing — architecture decisions aren’t about following trends, they’re about watching when the numbers cross the break-even point.
In April 2026, Qwen 3.6-27B crossed it.
Further reading
- Enterprise On-Prem LLM Architecture Blueprint
- Three Paths for AI Coding On-Prem
- ToolCall-15: Local Tool Calling Benchmark — Qwen 27B Sweet Spot
- Taalas ASIC: Burning LLMs into Silicon — The Future of Inference Cost
- Qwen Team Exodus: An Open-Source Governance Crisis
Sources
- Qwen 3.6-27B on Hugging Face
- NVIDIA DGX Spark product page
- Claude Sonnet 4.6 benchmark data
- DFlash paper and source
- DDTree research page
- Rapid-MLX v0.6.1 — Day-0 Qwen 3.6 Apple Silicon benchmarks
- Tweets from Mitko Vasilev (@iotcoi), LotusDecoder, Chris Maddern (@chrismaddern)