Disclaimer: This post is machine-translated from the original Chinese article: https://ai-coding.wiselychen.com/openclaw-cost-optimization-guide-97-percent-reduction/
The original work is written in Chinese; the English version is translated by AI.

The problem: hidden costs in the default configuration

默認配置的隱形成本

A reader DM’d me last week:

“My OpenClaw ran for a week and the API bill was $200, but I barely used it.”

Not an isolated case.

In my architecture analysis post, I explained why OpenClaw is called a “token shredder.” That post was about “why it’s expensive.” This one is about “how to make it cheaper.”

Based on several community write-ups, I tested a set of optimizations and found that with the default setup OpenClaw optimizes for capability—so it routes everything, including trivial “are you there?” heartbeats, to expensive flagship models (Claude Opus).

Main references:

It’s like “hiring a lawyer to check your inbox.” Financially irrational.

Community benchmarks show up to 97% reduction. In my own tests, just the first two optimizations can easily cut 50%+.


Five core optimization strategies

五大核心優化策略

From my tests, OpenClaw cost optimization can be summarized into five strategies:

# Strategy Primary effect
1 Session initialization eliminate context bloat
2 Model routing tier tasks intelligently
3 Local heartbeats zero-cost monitoring with Ollama
4 Prompt caching 90% discount on static content
5 Rate limiting prevent budget blow-ups

Let’s break them down.


Strategy 1: session initialization — eliminate context bloat

消除上下文膨脹:會話初始化

This is the deepest, most hidden cost killer.

By default, OpenClaw tends to re-upload the entire conversation history (all Slack/WhatsApp/Telegram history) every time, which makes each request consume a lot of tokens.

Default behavior vs optimized rules

Setup Context size Cost per session
Default (bloated) 50KB+ $0.40
Optimized (lean) 8KB $0.05

Savings: 87%.

What does the default load?

  • full history
  • logs
  • memory dumps

After optimization, load only essentials

  • SOUL.md
  • USER.md
  • IDENTITY.md

How to implement

Define “session initialization rules” in the system prompt to prevent automatically loading old chat history.

Add to SOUL.md:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
## Session Management

When I say "new session":
1. Clear all conversation history from context
2. Store important information in memory/YYYY-MM-DD.md
3. Start fresh with only core files (SOUL.md, USER.md)

## Startup Rules

At session start, ONLY load:
1. SOUL.md
2. USER.md
3. memory/today.md (if exists)

Do NOT auto-load:
- Full conversation history
- All channel messages
- Historical memory files older than 2 days

Strategy 2: tiered models and intelligent routing

智能模型分層與路由

Don’t use one model for everything. You need to assign different levels of “employees” based on task difficulty.

A three-tier model architecture

Tier Name Model Task types Relative cost
Tier 1 High intelligence Claude Opus / GPT-5.2 architecture decisions, complex refactors, security analysis 100% (baseline)
Tier 2 Routine work Claude Haiku / DeepSeek V3 code generation, research, drafting 10–50× cheaper
Tier 3 Brainless tasks Gemini Flash / Local LLM classification, heartbeats, simple queries 0–1%

Core principle

Model routing selects the most appropriate model automatically based on complexity, balancing cost and performance.

Add routing rules to SOUL.md

系統提示詞優化:賦予 Agent「靈魂」

1
2
3
4
5
6
7
8
9
10
## Model Selection

模型選擇:When in doubt, use Haiku first.
(有疑問時,優先用 Haiku)

上下文管理:Load ONLY essentials.
(僅加載必要資訊)

優化目標:Optimize for tokens.
(將 Token 使用量列為成功指標)

Strategy 3: local heartbeats — zero-cost monitoring

本地化心跳檢測:零成本監控

By default, OpenClaw runs a “Heartbeat” check every 30 minutes. This isn’t just “is the system alive?”—it’s a full background monitoring loop.

What does heartbeat do?

  • check unread email (anything urgent?)
  • scan calendar (what’s in the next 24–48 hours?)
  • monitor social mentions (Twitter/Telegram @’s)
  • check weather (if you might go out)
  • confirm system status and connectivity

Problem statement

Using paid APIs for heartbeat can cost $5–15/day (idle overhead).

Solution: localize it

Route heartbeats to local Ollama, taking the cost from $5/day to $0—and without consuming API rate limits.

Configuration flow

1
System Heartbeat → Route: Ollama (Local Host) → Model: llama3.2:3b → Cost: $0.00

Steps

  1. Install Ollama
  2. Pull a lightweight model: ollama pull llama3.2:3b
  3. Update config (see openclaw.json below)

Option B: cheap cloud models

If you don’t want to run Ollama locally, use very cheap hosted models:

  • Gemini Flash-Lite: $0.50 / million tokens
  • DeepSeek V3.2: $0.27 / million tokens

Monthly cost: ~$0.50–$1.00 for 24/7 stability.


Strategy 4: prompt caching — 90% discount for static content

利用提示詞快取降低 90% 開銷

System prompts and static files like SOUL.md rarely change. With caching, reusing that content can get you a 90% price discount.

OpenClaw will enable prompt caching automatically when supported. To maximize cache hit rate, the key is to keep static content in dedicated files.

Step 1: identify what to cache

✓ Cache this ❌ Don’t cache
system prompt (rarely changes) daily memory files (change often)
SOUL.md (operating principles) recent user messages (refresh each session)
USER.md (goals & context) tool output (changes per task)
reference materials (pricing, docs, specs)  
tool docs (rarely updated)  
project templates (standard structure)  

Step 2: structure for caching

1
2
3
4
5
6
7
8
9
/workspace/
├── SOUL.md          ← cache this (stable)
├── USER.md          ← cache this (stable)
├── TOOLS.md         ← cache this (stable)
├── memory/
│   ├── MEMORY.md    ← don't cache (frequently updated)
│   └── 2026-02-03.md ← don't cache (daily notes)
└── projects/
    └── [PROJECT]/REFERENCE.md ← cache this (stable docs)

Step 3: enable caching in config

Update ~/.openclaw/openclaw-config.json to enable prompt caching:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-haiku-4-5"
      },
      "cache": {
        "enabled": true,
        "ttl": "5m",
        "priority": "high"
      }
    },
    "models": {
      "anthropic/claude-sonnet-4-5": {
        "alias": "sonnet",
        "cache": true
      },
      "anthropic/claude-haiku-4-5": {
        "alias": "haiku",
        "cache": false
      }
    }
  }
}

Note: Caching is most effective with Sonnet (better for large-prompt reasoning tasks). Haiku is already very efficient, so caching has smaller benefits.

Config options:

Option Meaning
cache.enabled true/false — global prompt caching
cache.ttl time-to-live: “5m” (default), “30m” (longer sessions), “24h”
cache.priority “high” (prefer caching), “low” (balance cost/speed)
models.cache per-model true/false — recommend enabling for Sonnet; optional for Haiku

Step 4: cache-hit strategy

To maximize caching efficiency:

1. Batch requests within a 5-minute window

  • send multiple API calls back-to-back
  • reduce cache misses between requests

2. Keep the system prompt stable

  • don’t edit SOUL.md mid-session
  • changes invalidate cache; batch updates into a maintenance window

3. Layer your context

  • core system prompt (highest priority)
  • stable workspace files
  • dynamic daily notes (don’t cache)

4. For projects, separate stable vs dynamic content

  • product-reference.md (stable, cache)
  • project-notes.md (dynamic, don’t cache)
  • don’t let note edits invalidate caches

Real example: outbound email drafting

Assume you use Sonnet to write 50 outbound emails per week (reasoning + personalization):

No cache With cache (batched)
system prompt: 5KB × 50 = 250KB/week system prompt: 1 write + 49 cached
cost: $0.75/week cost: $0.016/week
50 emails × 8KB = $1.20/week 50 emails (~50% cache hit) = $0.60/week
Total: $1.95/week = $102/month Total: $0.62/week = $32/month
  Savings: $70/month

Before/after comparison

❌ Before ✓ After
system prompt sent every request system prompt reused via cache
cost: 5KB × 100 calls = $0.30 cost: 5KB × 100 calls = $0.003
no cache strategy batch within 5 minutes
random cache misses 90% cache hit on static content
repeated-content cost: $100+/month repeated-content cost: ~$10/month
single project: $50–100/month single project: $5–15/month
multi-project: $300–500/month multi-project: $30–75/month

Step 5: monitor cache performance

Use session_status to check caching:

1
2
3
4
5
6
7
openclaw shell
session_status

# cache metrics:
# Cache hits: 45/50 (90%)
# Cache tokens used: 225KB (vs 250KB without cache)
# Cost savings: $0.22 this session

Or query the API directly:

1
2
3
# check usage for the last 24 hours
curl https://api.anthropic.com/v1/usage \
  -H "Authorization: Bearer $ANTHROPIC_API_KEY" | jq '.usage.cache'

Metrics to track:

Metric Meaning
cache hit rate > 80% strategy works
cached tokens < 30% of input system prompt may be too large
cache writes keep increasing prompt is changing too often
session cost down -50% vs last week combined effect of caching + routing

Compounding effects with other optimizations

Caching amplifies previous savings:

Optimization Before After With caching
session initialization (lean context) $0.40 $0.05 $0.005
model routing (Haiku default) $0.05 $0.02 $0.002
heartbeat via Ollama $0.02 $0 $0
rate limiting (batch) $0 $0 $0
prompt caching $0 $0 -$0.015
Combined total $0.47 $0.07 $0.012

When not to use caching

  • Haiku workloads (too cheap): cache overhead > savings
  • frequently changing prompts: invalidation cost > benefits
  • tiny requests (<1KB): cache overhead eats the discount
  • dev/testing: too much iteration → constant misses

Strategy 5: budget control and rate limiting

預算控制與速率限制

Prevent “runaway automation” from burning your budget overnight.

Four guardrails

Guardrail Setting
API spacing at least 5 seconds between calls
search limits 10 seconds between searches; max 5 per batch
error handling on 429, stop and wait 5 minutes
budget caps daily budget $5 (warn at 75%)

Add to SOUL.md

1
2
3
4
5
6
7
8
9
10
11
12
13
## Rate Limits

ALWAYS follow these limits:
- API calls: minimum 5 seconds between calls
- Web searches: maximum 5 per batch, 10 seconds between each
- If you hit 429 error, STOP and wait 5 minutes
- If you hit rate limit, STOP and report to user

## Budget Awareness

- Daily budget: $5
- Warning at 75% ($3.75)
- Hard stop at 100%

Configure openclaw.json

1
2
3
4
5
6
7
8
{
  "budget": {
    "daily_limit_usd": 5,
    "monthly_limit_usd": 100,
    "warning_threshold": 0.75,
    "action_on_limit": "pause_and_notify"
  }
}

Key config: openclaw.json

關鍵配置:openclaw.json

Edit ~/.openclaw/openclaw.json to enable multi-model routing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "defaults": {
    "primary": "anthropic/claude-haiku-4-5"
  },
  "heartbeat": {
    "provider": "ollama",
    "model": "llama3.2:3b",
    "every": "1h"
  },
  "model_routing": {
    "default": "claude-haiku-4-5",
    "complex_reasoning": "claude-opus-4",
    "code_generation": "deepseek-r1",
    "simple_tasks": "gemini-2.0-flash"
  },
  "prompt_caching": {
    "enabled": true,
    "cached_files": ["SOUL.md", "USER.md", "IDENTITY.md"]
  },
  "budget": {
    "daily_limit_usd": 5,
    "monthly_limit_usd": 100,
    "warning_threshold": 0.75
  }
}

Key changes:

  • set default model to Haiku (instead of Opus)
  • point heartbeats to local Ollama

Real cases: savings by user type

真實案例:不同用戶類型的成本節省

User type Usage pattern Before After Savings
Everyday 24 heartbeats/day $200/month $70/month 65%
Heavy intensive sub-agent use $943/month $347/month 62%
Matt’s extreme case full local LLM deployment $1,500+/month <$50/month 97%

Validation & monitoring: ensure it’s working

驗證與監控:確保優化生效

Use session_status to check:

1
2
3
4
✅ Context Size: 2-8KB (target)
✅ Heartbeat: Ollama/Local (not API)
✅ Default Model: Haiku
✅ Cache Hit Rate: > 80%

If costs don’t drop, check whether openclaw.json and your system prompt were reloaded correctly.


Optimization checklist

優化行動清單

Step Action Notes
1 Install Ollama download and run llama3.2:3b
2 Update config set local heartbeat + model aliases in openclaw.json
3 Build a soul create SOUL.md rules for routing + initialization
4 Set guardrails implement rate limits + budget warnings
5 Verify run session_status to confirm metrics

Honestly

This post synthesizes Matt Ganzak and VelvetShark’s optimization experience plus my own tests.

But a few honest notes:

1. Not all optimizations are free

Using cheaper models for heartbeat can sometimes feel “slower.” Gemini Flash-Lite’s first response can be 200–500ms slower than Opus.

If your scenario is latency-sensitive (e.g., real-time trading), that trade-off might not be worth it.

2. Clearing context reduces continuity

A “new session” command saves money, but it also makes the agent “forget” what you were just talking about.

My approach: only use “new session” when switching topics; keep continuity within a single topic.

3. Model routing needs tuning

Default routing rules won’t perfectly fit your environment. I spent about two days tuning the boundaries (“which tasks use which model”).

For example, I found DeepSeek R1 is enough for Python scripts, but complex Bash scripts still benefit from Sonnet.

4. This isn’t set-and-forget

OpenClaw will evolve, Anthropic pricing will change, and new models will appear.

I re-check my configuration about once a month.


FAQ

Q: Will OpenClaw be less capable after optimization?

Depends on what you mean by “capable.” For the hardest tasks: no—complex reasoning can still route to Opus. For average response speed: slightly—cheaper models often have higher latency. In my experience, day-to-day use doesn’t feel different, but some edge cases need manual model switching.

Q: Should I run local Ollama or use Gemini Flash-Lite?

Depends on your setup. If you have spare local hardware (Mac mini, old PC), Ollama can make it $0. If you’re on a cloud VM or don’t want to operate local services, Gemini Flash-Lite is simpler. Personally I use Gemini—$1/month for stability is worth more than saving that $1.

Q: Isn’t RAG better for context bloat?

This is about OpenClaw’s design philosophy. RAG is great for “looking things up,” but OpenClaw wants “persistent cognitive state.” RAG can harm memory continuity and persona consistency. See my architecture analysis for more.

Q: Do strict guardrails hurt agent autonomy?

Yes. It’s a trade-off. My principle: I’d rather the agent get blocked sometimes than wake up to a $500 bill. Tune limits based on your risk tolerance.

Q: Do these optimizations apply to Claude Code?

Partially. Concepts like model routing and budget monitoring transfer. But Claude Code doesn’t have heartbeats and doesn’t have OpenClaw’s “homework read” pattern, so context bloat is less severe. Claude Code’s main costs come from long sessions and heavy tool use—optimization looks different.


Further reading


Resources & tools

資源與工具

Resource Description
Cost Calculator calculator.vlvt.sh
Local LLM ollama.ai (download Llama 3.2)
Config Templates see examples under ~/.openclaw/

Follow for more: @mattganzak


References: