Wisely Chen | AI Agents, On-Prem LLMs & Enterprise AI Architecture

Practical notes on enterprise AI transformation, agentic workflows, and AI security.

文章歸檔

共 33 篇文章

Qwen 3.6-27B Local Deployment: Sonnet 4.6-Class AI Agent Running on a DGX Spark / Mac mini

Qwen 3.6-27B, an open-source dense model, hits 136 tokens/sec on the $4,699 NVIDIA DGX Spark — beating Claude Opus 4.5 on benchmarks and edging out Sonnet 4.6 on Terminal-Bench. This post walks IT architects through hardware options for local Qwen 3.6-27B deployment (DGX Spark vs Mac mini M4 Pro 64GB), 12 official benchmarks, the Dflash + DDTree inference stack, a 3-year TCO comparison ($22,500 vs $4,729 per developer), and the architectural rewrites this triggers for on-prem AI Agent setups.

"Opus Is Too Smart, So It Shouldn't Be Doing the Planning" — A Paper That Flips the Agent Ops Paradigm

Columbia's AgentOpt paper ran 9 models in 81 combinations and proved it: Ministral 8B as Planner + Opus as Solver hits 74.27% accuracy, while Opus as Planner sits at 31.71%. Putting the most expensive model in the Planner seat is the worst move — because it's so strong it skips the tools and answers raw. Anthropic's own Advisor Tool is course-correcting in the same direction: cheap models run the main loop, Opus steps back as on-call advisor. The unit of agent pipeline optimization isn't single-model capability — it's how the model combo fits the specific task.

Cracking the Cache: From Gemma4 to Claude Code, Save 80% on Tokens

Open Claude Code in the morning, type one sentence — and 2–10% of your monthly quota is gone. I ran an experiment locally with Gemma4 and watched prompt processing drop from 31 seconds to 0.25 seconds — a 100x speedup. Then I dug into the Claude Code source and unpacked Anthropic's multi-layer cache architecture: DYNAMIC_BOUNDARY splitting, two-tier TTL, cache-break detection. From the KV cache fundamentals in Transformers, to the MLSys 2024 Prompt Cache paper, to the daily money-saving habits — once you understand the mechanism, the same plan can stretch 3–5x further.

Harness Engineering Architecture: AI Can Write Code, But It Can't Ship to Production on Its Own

Amazon let AI fix a bug; AI deleted the entire production environment. DataTalks.Club: AI wiped the whole database. An e-commerce team lost millions of orders to an AI change. Three incidents, one pattern: reset → rebuild → clean state. This post lays out the full picture of Harness Engineering in one architecture diagram — from Amazon's ban to OpenAI's Control Plane, from three-layer defense to the seven-component reference architecture, from five failure modes to three things your team can start tomorrow.

Harness Engineering Fully Decoded: When AI Agents Finish Writing Code, Is Your Repo Ready to Catch It Automatically?

3 OpenAI engineers shipped 1 million lines of code in 5 months using Codex—0 lines written by humans. They call this Harness Engineering—not the engineering of writing code, but the engineering of building constraints and feedback loops. Inspired by this, Ryan Carson published a complete Control-Plane Pattern: risk tier contract, preflight gate, SHA discipline, remediation loop. Last time we talked four layers of defense. This time we look at how a full control plane catches Agent output at speed.

OpenClaw's Five Ways to Browse the Web: From Search API to Taking Over Your Browser

AI agent 'browsing the web' isn't one thing — it's five. Pick the wrong mode and you're either missing capability or handing your accounts to an AI. OpenClaw's five web access architectures — Search API, Web Fetch, Managed Browser, Remote CDP, Extension Relay — each have wildly different capability ranges, security risks, and appropriate use cases. This post breaks down every layer: from the safest search APIs to the most dangerous full browser takeover, including the Accessibility Tree vs screenshot efficiency gap, the manual-login sweet spot for Managed Browser, and WebMCP's future potential.

Token Export: China's AI Is No Longer Selling Products—It's Selling Tokens

China's AI export is undergoing a qualitative shift—from selling products to selling Tokens. In February 2026, Chinese models (MiniMax, Kimi, GLM) overtook the US in production-grade Token call volume for the first time. GLM-5 walked away clean from distillation accusations, beat GPT-5.2 on SWE-bench, was trained entirely on Huawei chips, and its API is 5-8x cheaper. Stack together China's electricity price ($0.08/kWh vs US $0.18/kWh), open-source talent density, and hardware autonomy, and you see a new trade paradigm forming: exporting SOTA-90% reasoning capability to the world in a metered, priceable way. This isn't a story about the tech race. It's a story about cost structure.

The Channel War: OpenClaw, Anthropic, and Who Gets to Decide the Future of AI Agents

OpenClaw v2.19 shipped an Apple Watch MVP. Anthropic blocked OAuth to shut out third-party subscribers. Sam Altman recruited Peter Steinberger and embraced open source. Put these three things together and you see something beyond a technical competition — you see the most brutal reality of the AI industry: whoever controls the channel decides the model's fate.

OpenClaw Week: From the Claude Code 1.5 Era to a Digital Jarvis | Weekly Vlog EP8

A week-long deep dive into OpenClaw — from creator Peter's builder philosophy to the Memory architecture (AGENTS.md, SOUL.md), three token-saving tricks (cut 50%+ easily), and the 'new employee' enterprise security strategy. The AI agent that comes closest to a real digital Jarvis. Worth your time to understand it properly.

OpenClaw Token Optimization Guide: How to Cut AI Agent Operating Cost by 97%

Real intelligence isn’t paying for the most expensive model—it’s careful prompt and system design. This post shares five core optimization strategies—session initialization, model routing, local heartbeats, prompt caching, and rate limiting—shown in practice to reduce OpenClaw cost from ~$1,500/month to under $50.

Moltbot Security Hardening in Practice: A Complete Four-Layer Defense-in-Depth Guide for AI Agents

You don’t need to be a security expert—just be willing to spend an afternoon reading the docs carefully. This post distills Moltbot community battle-tested experience into a four-layer defense-in-depth playbook: Isolation, Quarantine, Rollback, and Transparency. It covers AI Agent Security, Prompt Injection Defense, LLM Agent Security, and an end-to-end Agentic Security framework.

When Unix Philosophy Meets AI: The Command Line Renaissance

When I was a kid I read a book called Unix Power Tools. There was a line I remembered for almost twenty years: ‘Command line pipeline is the best UI interface in the world.’ Back then I had no idea what it meant. But after Claude Code burst onto the scene in April 2025, I finally understood: a brain that understands the world through text plugged into an interface that exposes the world’s state through text. This isn’t retro—it’s structurally the most reasonable choice.

CaMeL: Google DeepMind’s Prompt-Injection Defense Architecture

Simon Willison called this ‘the first credible prompt injection defense’ he’s seen. CaMeL’s core design splits one agent into two: a low-privilege agent that reads external data, and a high-privilege agent that makes decisions—so ‘reading data’ and ‘taking actions’ are always separated.