Edge LLM serving, from the moment a user message lands to the moment a token streams back —
with the prompt cache, the KV cache, and the auto-compaction event that breaks them.
Workload: OpenClaw agent --local driving ollama with qwen2.5:7b-instruct-q4_K_M
on a 24 GB M3 MacBook Air. Numbers below come from real traces collected as part of my WukLab edge-inference work.
1. The five phases of a single turn
A turn isn't just a forward pass. The bulk of the wall-clock budget on the first invocation lives
outside the model: Node startup, plugin load, provider discovery, auth, lane admission,
session deserialization, prompt assembly. The Gantt below is from a real cold-prime trace.
The framework residual. Phase 1 alone is ~6 s every time the CLI starts — because every
turn is a fresh Node process. A daemon-mode invocation (channel inside the long-lived gateway)
collapses the framework overhead to ~400–650 ms per turn. That delta is one of the levers
the edge-inference work optimizes against.
2. What the LLM phase actually does
Inside Phase 4, ollama itself splits the work into three sub-phases. The KV cache is built during prefill
and is what makes every subsequent token cheap.
3. Two caches, one frame
Cache
What it stores
Granularity
Hit means
Miss costs
Prompt cache(a.k.a. prefix cache)
Last run's token sequence + the KV it produced, in DRAM
Per position — matches the longest common prefix between the new prompt and the cached one
Skip prefill for the matched prefix entirely (0 ms for that range)
Recompute K, V for every diverging token (prefill at ~80 tok/s on 7B)
KV cache
K and V tensors for every layer / head / position, in DRAM (q8 quantized here)
Per token × layer × head
Decode step reads existing K, V instead of recomputing attention from scratch
Without it, decode is O(L²) instead of O(L) — not deployable
4. Eight scenarios — when does the prefix cache survive?
These come from a real measurement suite: same model, same 15 700-token base prompt, vary only
what changes between turns. Each row's bar shows the call sequence's cache outcomes left-to-right.
1. Cold prime (warm residual after unload)
model just loaded, KV empty — baseline
MISS 82 s
2. Plain chat append (3 turns, same session)
fresh Node each turn, ollama keeps KV warm
HIT 445 msHIT 482 msHIT 442 ms
3. Tool-turn attempt
main agent has no tool bindings — emits text only
HIT 442 ms
4. Retrieval inlined into the user message (RAG)
prompt grows by ~4 KB of retrieved context
HIT 232 ms
5. Session growth — 6 turns piling up
does the framework residual grow with session size?
HITHITHITHITHITHIT 133 ms
6. Retrieval at system msg (template late-insert)
RAG snippet inserted into the system message instead of the user turn
SURPRISE 129 s
7. Search + fetch + write (multi-tool agent)
5 LLM calls + 4 tool calls, each call extends the suffix
HITtoolHITtoolHITtoolHIT
8. Agentic flow + auto-compaction + followup
compaction summarizes the tool chain between turn 0 and turn 1
turn-0 HITturn-1 MISS 135 s
cache hit (prefix reused)full miss (cold prefill)surprise miss (top-of-prompt mutation)tool call
5. The 35× spike — what auto-compaction does to the cache
In scenario 8, the prefix cache survives turn 0 because each LLM call appends tokens to the same
growing message list. Between turn 0 and turn 1, however, OpenClaw's runtime compacts
the session: the chain of tool-call / tool-result messages is replaced with a short summary message.
The new prompt prefix is not the prefix ollama just KV-cached — it's a different sequence
from the system message onward.
Empirically the prefill latency on the followup turn jumps from 3.9 s to 135 s — a
~35× spike, sitting ~24× off the trend line that fits every other partial-hit
data point in the suite. That's the cleanest demonstration in the dataset of a compaction-induced cache
miss, and it's the thing the optimization work targets.
6. Try it — simulate a turn
—
TTFT
—
prefill rate
—
decode rate
—
total wall-clock
7. Verified numbers
Measured on M3 MacBook Air, 24 GB unified, ollama 0.23.2 (OLLAMA_FLASH_ATTENTION=1,
OLLAMA_KV_CACHE_TYPE=q8_0). Full suite in
SoC-LAB/openclaw_test/docs/memory_budget_findings.md.
Scenario
Model
num_ctx
parallel
peak RSS (GB)
prefill (tok/s)
decode (tok/s)
p50 TTFT (s)
S1
3B
4096
1
2.11
416
37.2
7.2
S3
7B
4096
1
4.67
83
5.2
34.1
S5
7B
32768
1
5.57
76
5.1
64.3
S8
14B
32768
1
11.39
36
2.6
135.5
S9
14B
32768
2
14.18
23
2.4
210.4
S11
14B
32768
4
15.36
9
1.8
517.9
The 14B / 32k page-compressor cliff. S10 and S11 predict 18.3 / 21.3 GB of resident set,
but the kernel hands back only 14.7 / 15.4 GB — the missing 4–6 GB has been compressed by the
macOS page compressor. Predicted-vs-observed gap reaches 28 %, which is the hard ceiling for what
24 GB of unified memory can serve under that workload. The same diagram explains why the demo above
caps at 14B / 32k.