← Back to Projects

Inside an OpenClaw Turn

Edge LLM serving, from the moment a user message lands to the moment a token streams back — with the prompt cache, the KV cache, and the auto-compaction event that breaks them.

Workload: OpenClaw agent --local driving ollama with qwen2.5:7b-instruct-q4_K_M on a 24 GB M3 MacBook Air. Numbers below come from real traces collected as part of my WukLab edge-inference work.

1. The five phases of a single turn

A turn isn't just a forward pass. The bulk of the wall-clock budget on the first invocation lives outside the model: Node startup, plugin load, provider discovery, auth, lane admission, session deserialization, prompt assembly. The Gantt below is from a real cold-prime trace.

process-scoped (Node boot, plugin load) request-scoped (lane admission) prompt-scoped (session load, template render) model-scoped (LLM prefill + decode)

The framework residual. Phase 1 alone is ~6 s every time the CLI starts — because every turn is a fresh Node process. A daemon-mode invocation (channel inside the long-lived gateway) collapses the framework overhead to ~400–650 ms per turn. That delta is one of the levers the edge-inference work optimizes against.

2. What the LLM phase actually does

Inside Phase 4, ollama itself splits the work into three sub-phases. The KV cache is built during prefill and is what makes every subsequent token cheap.

3. Two caches, one frame

Cache	What it stores	Granularity	Hit means	Miss costs
Prompt cache (a.k.a. prefix cache)	Last run's token sequence + the KV it produced, in DRAM	Per position — matches the longest common prefix between the new prompt and the cached one	Skip prefill for the matched prefix entirely (0 ms for that range)	Recompute K, V for every diverging token (prefill at ~80 tok/s on 7B)
KV cache	K and V tensors for every layer / head / position, in DRAM (q8 quantized here)	Per token × layer × head	Decode step reads existing K, V instead of recomputing attention from scratch	Without it, decode is O(L²) instead of O(L) — not deployable

4. Eight scenarios — when does the prefix cache survive?

These come from a real measurement suite: same model, same 15 700-token base prompt, vary only what changes between turns. Each row's bar shows the call sequence's cache outcomes left-to-right.

1. Cold prime (warm residual after unload)

model just loaded, KV empty — baseline

MISS 82 s

2. Plain chat append (3 turns, same session)

fresh Node each turn, ollama keeps KV warm

HIT 445 ms HIT 482 ms HIT 442 ms

3. Tool-turn attempt

main agent has no tool bindings — emits text only

HIT 442 ms

4. Retrieval inlined into the user message (RAG)

prompt grows by ~4 KB of retrieved context

HIT 232 ms

5. Session growth — 6 turns piling up

does the framework residual grow with session size?

HITHITHIT HITHITHIT 133 ms

6. Retrieval at system msg (template late-insert)

RAG snippet inserted into the system message instead of the user turn

SURPRISE 129 s

7. Search + fetch + write (multi-tool agent)

5 LLM calls + 4 tool calls, each call extends the suffix

HITtool HITtool HITtool HIT

8. Agentic flow + auto-compaction + followup

compaction summarizes the tool chain between turn 0 and turn 1

turn-0 HIT turn-1 MISS 135 s

cache hit (prefix reused) full miss (cold prefill) surprise miss (top-of-prompt mutation) tool call

5. The 35× spike — what auto-compaction does to the cache

In scenario 8, the prefix cache survives turn 0 because each LLM call appends tokens to the same growing message list. Between turn 0 and turn 1, however, OpenClaw's runtime compacts the session: the chain of tool-call / tool-result messages is replaced with a short summary message. The new prompt prefix is not the prefix ollama just KV-cached — it's a different sequence from the system message onward.

Empirically the prefill latency on the followup turn jumps from 3.9 s to 135 s — a ~35× spike, sitting ~24× off the trend line that fits every other partial-hit data point in the suite. That's the cleanest demonstration in the dataset of a compaction-induced cache miss, and it's the thing the optimization work targets.

6. Try it — simulate a turn

Model

Prompt tokens

Output tokens

Cache state

—

TTFT

—

prefill rate

—

decode rate

—

total wall-clock

7. Verified numbers

Measured on M3 MacBook Air, 24 GB unified, ollama 0.23.2 (OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0). Full suite in SoC-LAB/openclaw_test/docs/memory_budget_findings.md.

Scenario	Model	num_ctx	parallel	peak RSS (GB)	prefill (tok/s)	decode (tok/s)	p50 TTFT (s)
S1	3B	4096	1	2.11	416	37.2	7.2
S3	7B	4096	1	4.67	83	5.2	34.1
S5	7B	32768	1	5.57	76	5.1	64.3
S8	14B	32768	1	11.39	36	2.6	135.5
S9	14B	32768	2	14.18	23	2.4	210.4
S11	14B	32768	4	15.36	9	1.8	517.9

The 14B / 32k page-compressor cliff. S10 and S11 predict 18.3 / 21.3 GB of resident set, but the kernel hands back only 14.7 / 15.4 GB — the missing 4–6 GB has been compressed by the macOS page compressor. Predicted-vs-observed gap reaches 28 %, which is the hard ceiling for what 24 GB of unified memory can serve under that workload. The same diagram explains why the demo above caps at 14B / 32k.