Cut OpenClaw API Costs by 95%

OpenClaw is powerful, but the API costs can surprise you. A single agent running heartbeats every 30 minutes, handling a few cron jobs, and responding to messages on Telegram can easily generate $150-$600 per month in LLM API costs if you configure it carelessly.

The problem is not the framework. The problem is that the default configuration most people copy from tutorials – Claude or GPT for everything, 30-minute heartbeats, bloated context files, no caching, no consideration of cheaper providers – is expensive. The good news: most of that cost is recoverable through mechanical changes that take an afternoon to implement.

The single biggest lever: stop defaulting to American frontier models. DeepSeek V3 costs $0.14 input / $0.28 output per million tokens. Claude Opus costs $15 / $75. That is a 53x price difference for comparable quality on most tasks.

This post covers the specific fixes that matter, with real configurations and expected savings.

Where the money goes

Before optimizing anything, you need to know what you’re paying for.

OpenClaw’s costs break down into five categories:

LLM API calls Every message, every heartbeat, every cron execution is a model inference. If you default to Claude Opus at $15 per million output tokens and run 500 turns per month, you can see how this compounds.

Context accumulation Your conversation history grows every turn. After 10 rounds of back-and-forth, you might be sending 150,000 tokens of context with each message. You pay for that context every single time unless prompt caching is working.

Heartbeat and background tasks A heartbeat that wakes your agent every 30 minutes is 48 turns per day, or ~1,440 per month. If each turn costs $0.10, that’s $144 just for the heartbeat.

Multi-round reasoning Complex tasks that require Claude to research, draft, revise, and finalize can make 5-10 API calls for a single user request. Each call carries the full context.

Model selection waste Using Claude Opus to check your inbox or format a log file is like hiring a senior engineer to make coffee. It works, but you’re paying 5x what the task is worth.

The baseline problem

Here is what an unoptimized OpenClaw setup looks like:

{
  "agents": {
    "defaults": {
      "model": "anthropic/claude-opus-4-6",
      "heartbeat": {
        "every": "30m"
      }
    }
  }
}

This configuration uses Opus for everything and runs a heartbeat every 30 minutes.

Claude Opus costs $15 per million input tokens and $75 per million output tokens. A typical heartbeat turn with full context might consume 8,000 input tokens and 200 output tokens. That’s about $0.14 per heartbeat. Run that 48 times per day and you’re at $6.72 daily, or ~$200 per month, just for the heartbeat.

Add normal conversation turns, a few cron jobs, and maybe some multi-agent coordination, and you’re easily over $400 per month.

That is the default experience if you do not tune anything.

Fix 0: Use cheaper providers

Before you optimize anything else, question the assumption that you need Claude or GPT.

Chinese LLMs like DeepSeek, GLM, MiniMax, and Qwen offer frontier-comparable quality at a fraction of the cost. DeepSeek V3 costs $0.14 input / $0.28 output per million tokens. Claude Haiku, Anthropic’s cheapest model, costs $0.80 / $4. That is a 5.7x difference on input and a 14x difference on output. For many tasks, the cheaper Chinese model is more than sufficient.

Here is a pricing comparison in order of output cost:

Model	Input (per 1M)	Output (per 1M)	Quality Tier
DeepSeek V3.2	$0.14	$0.28	GPT-4 class
GPT-5 Nano	$0.05	$0.40	Budget
MiniMax M2.7	$0.30	$1.20	Coding-focused
Qwen 3.5-Plus	$0.26	$1.56	Strong multilingual
GLM-5	$1.00	$3.20	Chinese-optimized
Claude Haiku 4.5	$0.80	$4.00	Fast, basic
GPT-5.2	$1.75	$14.00	Flagship
Claude Sonnet 4.6	$3.00	$15.00	Capable
Claude Opus 4.6	$15.00	$75.00	Premium

The math is simple. A typical heartbeat turn uses 8,000 input tokens and 200 output tokens. On Claude Opus, that costs $0.135. On DeepSeek V3, it costs $0.0012. That is a 112x difference.

Run 48 heartbeats per day on Opus and you pay $6.48 daily, or $194 per month. Run the same heartbeats on DeepSeek and you pay $0.058 daily, or $1.74 per month.

The configuration:

{
  "models": {
    "providers": {
      "deepseek": {
        "baseUrl": "https://api.deepseek.com",
        "apiKey": "your-deepseek-key",
        "models": [
          {
            "id": "deepseek-chat",
            "name": "DeepSeek V3",
            "cost": {
              "input": 0.14,
              "output": 0.28
            }
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": "deepseek/deepseek-chat"
    }
  }
}

The trade-offs: DeepSeek API latency is higher than Claude or GPT (2-4 seconds vs 1-2 seconds). The public API can be rate-limited during peak usage. And while DeepSeek V3 scores competitively on benchmarks, it is not identical to Claude Opus on every task.

But for 80% of agent work, like simple queries, formatting, log summaries, and routine monitoring, the quality difference does not matter. The cost difference does.

For tasks that genuinely need frontier reasoning, route those to Claude Sonnet or GPT-5. For everything else, use DeepSeek.

Expected savings: 80-95% on API costs if you move most workload to DeepSeek or similar Chinese models.

Fix 1: Model routing

Once you have a cheap baseline provider, the next step is intelligent routing.

The pattern: use the cheapest model that can handle the task. DeepSeek V3 for most work, Claude Sonnet for complex reasoning, Claude Opus almost never.

A heartbeat that checks three conditions and returns HEARTBEAT_OK does not need Claude Sonnet. Neither does formatting a log file, translating text, or answering a simple question. DeepSeek handles these fine at a fraction of the cost.

The routing hierarchy:

DeepSeek V3 ($0.14 / $0.28): default for everything
Claude Sonnet ($3 / $15): when DeepSeek struggles or you need very high quality
Claude Opus ($15 / $75): almost never; reserved for critical decisions that actually require it

Configure it:

{
  "agents": {
    "defaults": {
      "model": "deepseek/deepseek-chat",
      "heartbeat": {
        "every": "60m",
        "model": "deepseek/deepseek-chat"
      }
    }
  }
}

For tasks that need more power, override at the task level:

openclaw cron add \
  --name "weekly-review" \
  --cron "0 9 * * MON" \
  --model "anthropic/claude-sonnet-4-6" \
  --message "Summarize the week's work and suggest priorities"

The key discipline: do not default to expensive models and hope for the best. Default to the cheapest model that works and escalate only when necessary.

Expected savings: 50-80% beyond the baseline provider switch.

Fix 2: Local models via Ollama

For tasks that run on a schedule and do not require cutting-edge reasoning, local models eliminate API costs entirely.

Ollama is a local LLM runtime. You download model weights once, and every inference after that is free. No API keys, no per-token charges, no rate limits.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull a capable model:

ollama pull llama3.3:70b

Configure OpenClaw to use it:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [
          {
            "id": "llama3.3:70b",
            "name": "Llama 3.3 70B",
            "cost": {
              "input": 0,
              "output": 0
            }
          }
        ]
      }
    }
  }
}

Now route low-stakes tasks to the local model:

openclaw cron add \
  --name "inbox-triage" \
  --cron "0 */6 * * *" \
  --model "ollama/llama3.3:70b" \
  --message "Check inbox and flag anything urgent"

Hardware requirements: a 70B model needs about 40GB of RAM when quantized. If you have a Mac with 32GB+ unified memory or a Linux server with 64GB, this works. If not, smaller models like llama3.2:3b run on 8GB machines but with reduced capability.

The trade-off is performance. Local models are slower than API calls (5-15 seconds for a response vs 1-2 seconds), and current open-weight models perform below GPT-4 and Claude Sonnet on complex reasoning tasks. But for routine work, like log summaries, inbox checks, and cron reports, they are more than adequate.

Expected savings: 100% on tasks routed to local models. If 30% of your workload can run locally, that is 30% of your API bill gone.

Fix 3: Prompt caching

Prompt caching reduces the cost of repeated context by storing it on the provider’s side and charging 90% less for cache reads.

The way this works: your system prompt and conversation history get sent with every request. If that content has not changed since the last request and the provider supports caching, they serve it from cache instead of reprocessing it. You pay about $0.15 per million tokens for cache reads instead of $1.50 for fresh input.

Both Anthropic and OpenAI support this. Enable it in OpenClaw:

{
  "agents": {
    "defaults": {
      "params": {
        "cacheRetention": "short"
      }
    }
  }
}

The cache expires after 5 minutes of inactivity by default. That means if you send a message, wait 6 minutes, and send another, the cache is cold and you pay full price again.

For setups with frequent interactions, caching can save 60-70% on input token costs. For setups where messages are spaced more than 5 minutes apart, the benefit is smaller.

One caveat: context compaction invalidates the cache. Every time OpenClaw summarizes old messages to free up space, the cache breaks and the next request pays full price. This is why keeping context files small matters because it reduces compaction frequency.

Expected savings: 30-50% on input tokens for high-frequency usage. Less if your interactions are sparse.

Fix 4: Heartbeat optimization

The default heartbeat interval is 30 minutes. That is 48 turns per day. If each turn costs $0.10, that is $144 per month just for the heartbeat.

The first fix is to increase the interval:

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "60m"
      }
    }
  }
}

That cuts heartbeat costs in half immediately.

The second fix is to slim down HEARTBEAT.md. A bloated heartbeat file with 15 checks and paragraphs of instructions costs more tokens per turn than a minimal checklist.

Before:

# Heartbeat Checklist

Please check the following items carefully and let me know if anything needs attention:

- [ ] Review my inbox for urgent messages from clients or team members
- [ ] Check calendar for upcoming meetings in the next 2 hours
- [ ] Verify server health metrics are within normal ranges
- [ ] Scan error logs for any critical failures
- [ ] Monitor API rate limits and usage
...

After:

# Heartbeat

- [ ] Urgent inbox items
- [ ] Meetings next 2h
- [ ] Server health

Keep it to 3-5 essential checks. Everything else can be a dedicated cron job.

The third fix is to use a cheap model for heartbeats:

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "60m",
        "model": "deepseek/deepseek-chat"
      }
    }
  }
}

DeepSeek V3 costs $0.14 input / $0.28 output per million tokens. A heartbeat turn with 1,500 input tokens and 100 output tokens costs about $0.00024. Run that 24 times per day and you are at $0.006 daily, or about $0.18 per month. Compare that to $200 per month with Opus at 30-minute intervals.

Expected savings: 95-99% on heartbeat costs.

Fix 5: Context window management

Every token in the context window costs money. Common culprits: loading entire files when only a summary is needed, including full conversation history when only recent messages matter, and repeating content that is already in the system prompt.

The first fix is to keep MEMORY.md and SOUL.md focused. A 5,000-word personality file is wasteful. Aim for under 1,000 words.

The second fix is to use targeted reads:

Instead of:

Read the full contents of memory/notes.md

Use:

Read only the last 30 days from memory/notes.md

The third fix is to enable session pruning. This automatically trims old tool results and context after the cache TTL expires:

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "1h"
      }
    }
  }
}

This ensures that post-idle requests do not re-cache oversized history.

Expected savings: 20-40% on context costs.

Fix 6: Cron job frequency

Do you really need that competitor price check every 6 hours? Most metrics that matter can be checked daily or weekly.

Change this:

openclaw cron add \
  --name "price-check" \
  --cron "0 */6 * * *" \
  --message "Check competitor pricing"

To this:

openclaw cron add \
  --name "price-check" \
  --cron "0 9 * * 1" \
  --message "Check competitor pricing"

That is once per week instead of four times per day. Saves 96% of the API calls for that job.

Audit all your cron jobs and ask: does this need to run this often? For most monitoring and reporting tasks, the answer is no.

Expected savings: 10-30% on total costs, depending on how many cron jobs you have.

Fix 7: QMD for local memory

OpenClaw’s default memory system embeds chunks of your workspace files and stores them in SQLite. Every time you query memory, it calls the embedding API. That costs money.

QMD is a local hybrid search engine that runs embeddings and reranking on your machine using local GGUF models. No API costs for memory retrieval.

Install QMD:

npm install -g @tobilu/qmd

Configure OpenClaw to use it:

{
  "memory": {
    "provider": "qmd",
    "qmd": {
      "searchMode": "hybrid",
      "limits": {
        "timeoutMs": 4000
      }
    }
  }
}

QMD indexes your workspace files, session transcripts, and any other markdown collections you point it at. Everything runs locally. No embedding API calls.

The trade-off is setup complexity and hardware requirements. QMD needs SQLite with extensions and enough RAM to run the local embedding model. For most setups, that is manageable.

Expected savings: Eliminates all embedding API costs. If you are paying $20-$50 per month for embeddings, this brings that to zero.

Fix 8: Output token limits

Output tokens cost 3-8x more than input tokens. The default max_tokens is often set to 4,096, but most responses are under 1,000 tokens.

Set it explicitly based on actual usage:

{
  "agents": {
    "defaults": {
      "params": {
        "max_tokens": 1024
      }
    }
  }
}

For tasks that need longer output, override at the task level. But for general conversation, 1,024 is usually enough.

This does not save tokens directly because the model stops generating when it is done regardless of the limit. But it does prevent runaway generation and enforces discipline on output length.

Expected savings: 5-15% by preventing occasional long outputs that were not necessary.

Fix 9: Output compression prompts

Output tokens cost 3-8x more than input tokens. Every word Claude generates costs money. LLMs are trained to be conversational, which means they waste tokens on pleasantries, hedging, and filler.

A response to “how does authentication work?” might open with “Here’s how authentication works in this app” and spend several sentences framing context before delivering actual facts. That padding is expensive.

The fix is to prompt for compressed output.

The caveman approach. This is a Claude Code skill that instructs the model to respond like a caveman, removing articles, pleasantries, hedging, and conversational filler while keeping technical accuracy.

Normal Claude (69 tokens):

The reason your React component is re-rendering is likely because you're 
creating a new object reference on each render cycle. When you pass an 
inline object as a prop, React's shallow comparison sees it as a different 
object every time, which triggers a re-render. I'd recommend using useMemo 
to memoize the object.

Caveman Claude (19 tokens):

New object ref each render. Inline object prop = new ref = re-render. 
Wrap in `useMemo`.

That is a 72% reduction. Same fix, same technical accuracy, far fewer tokens.

For OpenClaw, add this to your system prompt or agent instructions:

Respond concisely. Remove articles (a, an, the), pleasantries ("I'd be happy to"), 
hedging ("it might be worth considering"), and unnecessary framing. State facts 
directly. Keep code blocks and technical terms exact. Write git commits and 
documentation normally.

The “be concise” directive. Simply adding “Be concise” or “Answer briefly” to your prompts reduces output length by 30-45% on average. It is less aggressive than caveman mode but still effective.

For heartbeats and cron jobs that produce logs or reports:

openclaw cron add \
  --name "daily-summary" \
  --cron "0 9 * * *" \
  --model "deepseek/deepseek-chat" \
  --message "Summarize yesterday's work. Be concise. 2-3 sentences max."

Persona-based compression. Instruct the model to adopt a concise persona: “You are a concise technical writer” or “You are a terse Unix sysadmin.” This implicitly guides response length without requiring explicit instructions in every prompt.

{
  "agents": {
    "defaults": {
      "systemPrompt": "You are a concise technical assistant. Provide direct, fact-based answers. Skip pleasantries and unnecessary context."
    }
  }
}

The trade-off is readability. Caveman mode works great for logs, internal reports, and technical Q&A where you just need the information. It reads strangely in customer-facing chat or documentation. Use it selectively.

Expected savings: 30-75% on output tokens depending on how aggressively you compress.

The combined result

Apply all of these fixes and the cost profile changes dramatically.

Before:

Default model: Claude Opus ($15 input / $75 output)
Heartbeat: every 30 minutes on Opus
No caching
Bloated context files
Frequent cron jobs
Embedding API calls for memory

Monthly cost: $400-$600

After:

Default model: DeepSeek V3 ($0.14 input / $0.28 output)
Heartbeat: every 60 minutes on DeepSeek
Prompt caching enabled
Trimmed context files
Optimized cron frequency
Local models (Ollama) for sensitive/routine tasks
QMD for local memory

Monthly cost: $5-$20

That is a 95-98% reduction.

The breakdown by fix:

Chinese LLMs: 80-95% baseline savings
Model routing: 50-80% additional savings on escalated tasks
Local models: 100% savings on routed tasks
Prompt caching: 30-50% on input tokens
Heartbeat optimization: 95-99% on heartbeat costs
Context management: 20-40% on context costs
Cron frequency: 10-30% on scheduled tasks
QMD: Eliminates embedding costs
Output compression: 30-75% on output tokens
Output limits: 5-15% on generation waste

The fixes compound. Using DeepSeek as your default gets you most of the way there. Output compression gets you another 30-75% on top of that. Everything else stacks.

What you lose

These optimizations are not free. There are trade-offs.

API latency on Chinese models. DeepSeek’s public API responds in 2-4 seconds. Claude and GPT respond in 1-2 seconds. If you are running an interactive chat where every half-second matters, this is noticeable. The fix is to use DeepSeek for background tasks and cron jobs, and route interactive chat to faster providers when latency matters.

Rate limits on public APIs. DeepSeek’s free tier has usage caps. Heavy workloads can hit them. The fix is either paying for a tier with higher limits or self-hosting via providers like Together AI or Fireworks that offer DeepSeek inference with better throughput guarantees.

Response quality on edge cases. DeepSeek V3 scores competitively on benchmarks, but it is not identical to Claude Opus on every task. For most agent work, the difference does not matter. For critical reasoning tasks like legal analysis, complex debugging, and high-stakes decision-making, you might want Claude Sonnet or better. The fix is selective escalation. Default to cheap, escalate when necessary.

Latency on local models. A local 70B model takes 5-15 seconds to respond. An API call takes 1-2 seconds. If speed matters, this is a problem. The fix is to use local models only for background tasks, not interactive chat.

Cache invalidation on compaction. Prompt caching saves money, but context compaction breaks the cache. Keeping context small reduces compaction frequency, but you cannot eliminate it entirely on long-running sessions.

Readability with aggressive output compression. Caveman mode cuts 70%+ of output tokens but reads strangely. “New object ref each render” is efficient but not conversational. This is fine for logs, internal reports, and technical Q&A. It is not appropriate for customer-facing chat or polished documentation. The fix is selective use: compress where readability does not matter, keep natural language where it does.

Setup complexity. The default configuration is simple. Optimizing it requires understanding model routing, heartbeat mechanics, cron scheduling, and memory architecture. That is an afternoon of work, not five minutes.

The question is whether the trade-offs are worth the savings. For most personal OpenClaw setups, they are.

When this matters

If you are running OpenClaw casually, with a few messages per day, no cron jobs, and no heartbeat, you probably do not need to optimize aggressively. Your bill is already under $10 per month even on DeepSeek.

If you are running OpenClaw as a production assistant, with continuous heartbeat, multiple cron jobs, group chat integrations, and multi-agent coordination, unoptimized costs on Claude Opus can hit $600 per month. Switch to DeepSeek and apply the other fixes and you can get that under $20.

The pattern: light usage is cheap even unoptimized. Heavy usage on frontier models is expensive unless you tune it. Heavy usage on Chinese models is cheap even without perfect tuning.

What I would add next

Provider cost monitoring. Track which provider handled each request and what it cost. Surface a daily report showing provider distribution and total spend. This makes it obvious when routing logic is sending too much traffic to expensive models.

Per-task model selection. Rather than defaulting everything to DeepSeek and manually overriding complex tasks, build a router that analyzes the task and picks the right model automatically. This is more sophisticated but removes the manual decision-making.

Multi-provider fallback. Configure multiple cheap providers (DeepSeek, GLM, MiniMax, Qwen) with automatic failover. If DeepSeek is rate-limited or down, the request falls back to the next cheapest option. This improves reliability without increasing baseline cost.

Hybrid local + cloud workflows. Use local models for 80% of tasks and reserve API models for the 20% that actually need them. This is easier with model failover: try local first, escalate to cloud if the local model struggles.

Automated cost tracking. Log every API call with its token count and cost. Surface a daily or weekly report showing where the money went. Without this, you are optimizing blind.

Dynamic heartbeat intervals. Instead of a fixed 60-minute interval, adjust based on activity. If you have not sent a message in 4 hours, the heartbeat can slow down. If you are actively chatting, it can wake more often.

The broader point

The default OpenClaw configuration is optimized for getting started, not for cost efficiency. Most tutorials use Claude or GPT because those are the most familiar options, not because they are the cheapest.

But the cheapest option that works is usually the right option. DeepSeek V3 costs 1/53rd of Claude Opus and 1/10th of Claude Sonnet. For 80% of agent work, that quality difference does not justify the price difference.

The tuning is mechanical. Most of it is configuration changes, not code. An afternoon of work, spent switching providers, adjusting heartbeat intervals, enabling caching, and trimming context files, can cut your bill by 95% without breaking core functionality.

The hard part is not the technical work. The hard part is questioning the defaults. Everyone uses Claude because everyone uses Claude. That does not make it the right choice for your workload. This post is the roadmap for making a different choice.

Where the money goes#

The baseline problem#

Fix 0: Use cheaper providers#

Fix 1: Model routing#

Fix 2: Local models via Ollama#

Fix 3: Prompt caching#

Fix 4: Heartbeat optimization#

Fix 5: Context window management#

Fix 6: Cron job frequency#

Fix 7: QMD for local memory#

Fix 8: Output token limits#

Fix 9: Output compression prompts#

The combined result#

What you lose#

When this matters#

What I would add next#

The broader point#