← Back to Projects

Inside an OpenClaw Turn

Edge LLM serving, from the moment a user message lands to the moment a token streams back — with the prompt cache, the KV cache, and the auto-compaction event that breaks them.

Workload: OpenClaw agent --local driving ollama with qwen2.5:7b-instruct-q4_K_M on a 24 GB M3 MacBook Air. Numbers below come from real traces collected as part of my WukLab edge-inference work.

1. The five phases of a single turn

A turn isn't just a forward pass. The bulk of the wall-clock budget on the first invocation lives outside the model: Node startup, plugin load, provider discovery, auth, lane admission, session deserialization, prompt assembly. The Gantt below is from a real cold-prime trace.

0 20 40 80 120 160 192.8 s wall-clock seconds 1 plugin / auth boot 6.1 s CLI parse lane admission 3.2 s pre-prompt setup 3.7 s LLM (load + decode) 172.1 s — prefill 15.7k tok, decode 240 tok teardown 10 ms turn 0 — cold prime (qwen2.5:7B, 15.7k-tok prompt)
process-scoped (Node boot, plugin load) request-scoped (lane admission) prompt-scoped (session load, template render) model-scoped (LLM prefill + decode)
The framework residual. Phase 1 alone is ~6 s every time the CLI starts — because every turn is a fresh Node process. A daemon-mode invocation (channel inside the long-lived gateway) collapses the framework overhead to ~400–650 ms per turn. That delta is one of the levers the edge-inference work optimizes against.

2. What the LLM phase actually does

Inside Phase 4, ollama itself splits the work into three sub-phases. The KV cache is built during prefill and is what makes every subsequent token cheap.

model load mmap weights prefix-cache match longest common prefix vs. last run's KV cache prefill (new tokens) compute K, V for the suffix 7B → ~80 tok/s prefill decode (loop) stream output tokens 7B → ~5 tok/s (bandwidth-bound) KV cache (per layer × head) qwen2.5-7B, q8_0 KV: ~140 KB / token / sequence written by prefill, read every decode step at num_ctx=32k: ~4.6 GB resident writes reads

3. Two caches, one frame

CacheWhat it storesGranularityHit meansMiss costs
Prompt cache (a.k.a. prefix cache) Last run's token sequence + the KV it produced, in DRAM Per position — matches the longest common prefix between the new prompt and the cached one Skip prefill for the matched prefix entirely (0 ms for that range) Recompute K, V for every diverging token (prefill at ~80 tok/s on 7B)
KV cache K and V tensors for every layer / head / position, in DRAM (q8 quantized here) Per token × layer × head Decode step reads existing K, V instead of recomputing attention from scratch Without it, decode is O(L²) instead of O(L) — not deployable

4. Eight scenarios — when does the prefix cache survive?

These come from a real measurement suite: same model, same 15 700-token base prompt, vary only what changes between turns. Each row's bar shows the call sequence's cache outcomes left-to-right.

1. Cold prime (warm residual after unload)

model just loaded, KV empty — baseline
MISS 82 s

2. Plain chat append (3 turns, same session)

fresh Node each turn, ollama keeps KV warm
HIT 445 ms HIT 482 ms HIT 442 ms

3. Tool-turn attempt

main agent has no tool bindings — emits text only
HIT 442 ms

4. Retrieval inlined into the user message (RAG)

prompt grows by ~4 KB of retrieved context
HIT 232 ms

5. Session growth — 6 turns piling up

does the framework residual grow with session size?
HITHITHIT HITHITHIT 133 ms

6. Retrieval at system msg (template late-insert)

RAG snippet inserted into the system message instead of the user turn
SURPRISE 129 s

7. Search + fetch + write (multi-tool agent)

5 LLM calls + 4 tool calls, each call extends the suffix
HITtool HITtool HITtool HIT

8. Agentic flow + auto-compaction + followup

compaction summarizes the tool chain between turn 0 and turn 1
turn-0 HIT turn-1 MISS 135 s
cache hit (prefix reused) full miss (cold prefill) surprise miss (top-of-prompt mutation) tool call

5. The 35× spike — what auto-compaction does to the cache

In scenario 8, the prefix cache survives turn 0 because each LLM call appends tokens to the same growing message list. Between turn 0 and turn 1, however, OpenClaw's runtime compacts the session: the chain of tool-call / tool-result messages is replaced with a short summary message. The new prompt prefix is not the prefix ollama just KV-cached — it's a different sequence from the system message onward.

After turn 0 — KV cache lays out matching turn 0's message tape system user₀ tool_call₁ tool_result₁ tool_call₂ tool_result₂ assistant₀ turn 0 prefill: HIT — ollama KV cache covers the whole tape Before turn 1 — OpenClaw compacts the session jsonl system summary (replaces tool chain) user₁ turn 1 prefill: MISS — first token after system differs → cache invalid from position 30 onward → recompute

Empirically the prefill latency on the followup turn jumps from 3.9 s to 135 s — a ~35× spike, sitting ~24× off the trend line that fits every other partial-hit data point in the suite. That's the cleanest demonstration in the dataset of a compaction-induced cache miss, and it's the thing the optimization work targets.

6. Try it — simulate a turn

TTFT
prefill rate
decode rate
total wall-clock

7. Verified numbers

Measured on M3 MacBook Air, 24 GB unified, ollama 0.23.2 (OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0). Full suite in SoC-LAB/openclaw_test/docs/memory_budget_findings.md.

ScenarioModelnum_ctxparallelpeak RSS (GB)prefill (tok/s)decode (tok/s)p50 TTFT (s)
S13B409612.1141637.27.2
S37B409614.67835.234.1
S57B3276815.57765.164.3
S814B32768111.39362.6135.5
S914B32768214.18232.4210.4
S1114B32768415.3691.8517.9
The 14B / 32k page-compressor cliff. S10 and S11 predict 18.3 / 21.3 GB of resident set, but the kernel hands back only 14.7 / 15.4 GB — the missing 4–6 GB has been compressed by the macOS page compressor. Predicted-vs-observed gap reaches 28 %, which is the hard ceiling for what 24 GB of unified memory can serve under that workload. The same diagram explains why the demo above caps at 14B / 32k.