Holo3-35B-A3B · OSWorld-Verified · NUC16 — KV-cache & accelerator study SYSTEMS

why prefill dominates, and whether the iGPU / 96 GB can help profiling corpus →  ·  live tracker →  ·  ← research

The question

The Holo3 / OSWorld run is served entirely on one Intel Panther Lake NUC (Core Ultra 7 356H, 96 GB unified, Xe3 iGPU, NPU 5). It is prefill-bound: ~89% of compute time is prompt processing, not generation. Two levers were measured against that wall — (1) managing the KV cache so the prefill stops being re-done every step, and (2) offloading to the iGPU / spending the 96 GB. All numbers below are measured on the live machine (llama.cpp b9535, Holo3-35B-A3B i1-Q4_K_M).

Headline. The prefill wall is quadratic in steps — an artifact of how the agent edits its own context, not of the model or the hardware. Making the conversation append-only turns it linear (a 4–10× prefill reduction, runtime-agnostic), and it is affordable precisely because of the 96 GB: this hybrid model keeps only 10 of 40 layers in the position-indexed KV cache, so its full 262K-token context costs only ~5 GB of KV — the box can simply never evict. The iGPU, separately, is 2.5× faster at prefill but is blocked by two concrete correctness bugs, so CPU remains the fastest correct path today.

1 · The prefill wall: KV reuse collapses every step

Each agent step sends a screenshot and the model emits a short action (a JSON tool call). Decode is tiny (~280 tokens/step); the cost is prompt processing. With a working prefix cache, step N should reuse step N−1's KV and prefill only the new tokens. It does not — and the per-step trace of one 100-step task shows exactly why:

stepprompt totalKV reused (cache_n)re-prefilledreuse
25,2333,0032,2300.57
37,4795,2292,2500.70
47,6889476,7410.12
5017,23394716,2860.055
10027,24694726,2990.035

From step 4 onward, cache_n is frozen at exactly 947 — the static system+tools preamble — while the prompt grows to 27k. Reuse decays as 947 / total. The cause is in the agent loop: to honor H Company's max_images=3 budget, older screenshots are rewritten in place to the text "[screenshot evicted]". In a causal transformer, mutating a token that sits early in the context invalidates the KV of everything after it — so the longest reusable prefix collapses back to the 947-token preamble that precedes the first screenshot, every step.

The real shape of the cost. Because the whole suffix is re-prefilled each step, total prefill over a task is Σ(re-prefilled) ≈ a 100-step task pays for ~1.6M prefilled tokens — quadratic in step count. One screenshot alone is ~2,014 tokens, so the context fills fast and the per-step bill keeps climbing (from ~75 s early to 400 s+ on CPU as the context fills — the ~7.7 min/step seen on cap-bound tasks). Across 113 scored tasks, model inference is 98% of wall-clock; ~89% of that is prefill. This is the single biggest latency lever.

1b · Where a task's wall-clock actually goes (per-step timeline)

In the timeline style of OSWorld-Human (Abhyankar, Qi & Zhang), here is one real successful OSWorld task (chrome, 8 steps, score 1.0) decomposed per step into prefill, decode, and action. OSWorld-Human reports that "each successive step can take 3× longer than steps at the beginning of a task" — that is exactly what the top track shows, and the cause here is mechanical: at step 4 the 3-screenshot eviction breaks the KV prefix, so prefill triples (≈59 s → ≈175 s) and never recovers.

measured (current) 1200s KV-stable (projected) 622s ↓1.9× 03006009001200 wall-clock seconds (one OSWorld task, 8 steps, success) step 4: KV reuse collapses → prefill 3×

prefill (prompt processing, incl. screenshot encode) decode (reasoning + action) action (VM execution)

Prefill is 89% of model time here (1,048 s of 1,200 s). The lower track projects the append-only KV-stable case, where steps 4–8 prefill only the ~2k-token delta instead of re-processing the whole window: the task contracts ~1.9× (1,200→622 s) and the per-step cliff disappears. (The terminal-slot scheme of §2b instead holds prefill at a constant ~6k/step — a smaller win on an 8-step task like this, but the decisive one on long, high-step tasks where today's re-prefill balloons to 27k.) Unlike OSWorld-Human's multi-module agents (where separate planning / reflection / judging model calls dominate), Holo3 is one model call per step — so the whole latency is a single prefill+decode, and KV-cache management is the latency lever.

2 · Managing the KV cache for OSWorld

The model is qwen3_5_moe — a hybrid: of its 40 layers, only the 10 full-attention layers hold a position-indexed KV cache; the other 30 are Gated-DeltaNet / linear-attention layers carrying an O(1) recurrent state (one running summary per sequence, not per-token KV). That single fact reshapes every option.

Option A — --cache-reuse (KV-shift the reusable prefix): ruled out Verified inert in llama.cpp source. It is disabled at startup because the model is multimodal (server-context.cpp:997) and refused per-request because every prompt carries image tokens (:2716, can_cache_reuse = can_shift && !has_mtmd). Even text-only it would be unsound: the hybrid memory reports can_shift=true by delegating to its attention half (llama-memory-hybrid.cpp:133), but the recurrent seq_add only bumps a single tail-cell's position (llama-memory-recurrent.cpp:304) — it cannot reconstruct a gapped-prefix running state, so it would silently corrupt 30 of 40 layers. KV-shift is meaningful only for the 10 attention layers.

Option B — append-only context, never evict: the lever The collapse is self-inflicted by the in-place eviction. If the history is append-only — old screenshots stay where they are, each step only adds the new screenshot + action — the prefix stays byte-identical and the KV (attention and recurrent state) is reused. Per-step prefill drops from the 8–27k re-prefill to the ~2k-token delta; total prefill goes from quadratic to linear (~200k vs ~1.6M tokens on a 100-step task): a 4–10× cut, on CPU, no new hardware.

Why 96 GB is what unlocks it. Append-only means the context can grow to a whole task (~2k tok/step × 100 ≈ 200k tokens). Holo3 supports 262K. The usual objection is KV memory — but here only 10/40 layers are cached, so a full 262K window is ~5 GB of KV, not tens of GB. The 96 GB unified pool holds the 20 GB weights + a 262K context + the OSWorld VM with room to spare, so the agent can simply never evict for any OSWorld task. The memory budget is what converts "stop evicting" from impossible to free.

The honest caveat. max_images=3 is H Company's documented protocol — the 80.4%-leaderboard config. Keeping every screenshot deviates from it, so the accuracy effect (does Holo3 do as well, better, or worse with a long visual history?) is an open A/B question, not a settled win. The current run is a pure replication and is left untouched; this is the next-run experiment. Cheaper variants on the same axis: lower screenshot resolution (fewer image tokens), or summarize-then-drop old turns at a stable boundary instead of mutating them — both trade fidelity for prefill and both need the same A/B.

2b · Can screenshots live in swappable KV slots?

A sharper idea than "never evict": lay the prompt out as [preamble][img-slot-1][img-slot-2][img-slot-3][text] and each step swap the oldest screenshot out of its slot, prefilling only the new image + appended text while reusing the rest of the KV. On a standard transformer this is a real, published technique — at a measured cost:

methodmechanismrecomputequality cost
Prompt Cache (MLSys'24)position-anchored modules; mask cross-attention0%<1 pt (if self-contained)
CacheBlend (EuroSys'25)recompute high-deviation tokens5–18%~0.01 F1
EPICrecompute chunk-boundary tokens (LegoLink)~16–20 tok0–7%

But the literal "swap a middle slot" version is blocked on this model. Holo3 is hybrid: 30 of 40 layers are Gated-DeltaNet with a sequential recurrent state (st = st-1·gt + kt·dt). Editing a token at position p invalidates every state after p, with no per-token KV to splice — and since the evicted screenshot sits near the front, a middle-swap re-scans ~the whole suffix (today's cost). Causal attention adds the same staleness on the 10 full-attention layers. (This is also why --cache-reuse / KV-shift is unsound here.)

The twist that makes it work — losslessly. The same strict causality cuts both ways: a state snapshot at position p is an exact summary of the prefix [0..p]. So put the ≤3 screenshots at the END, checkpoint the full state (attention KV + recurrent ssm/conv) at the image-region start, and each step restore the checkpoint and re-prefill only the current image window (~6k tokens, constant) — exact for all 40 layers, not approximate. llama.cpp already ships this: --ctx-checkpoints (on by default, n=32) with llama_state_seq_get/set_data_ext(PARTIAL_ONLY) serializing the recurrent state (server-context.cpp:2033, llama-memory-recurrent.cpp:864). No model change — agent/server orchestration only. The forced corrections to the idea: the slots must be terminal, and the mechanism is checkpoint-restore, not KV-pointer-swap.

So the instinct ("only compute the swapped-in screenshot") is right. Terminal slots keep the 3-image count (close to the model's training regime) at a constant ~6k-tok/step with bounded memory; append-only is cheaper (~2k/step) but keeps every image and grows context. Both reorder the prompt vs the interleaved leaderboard layout, so both warrant an accuracy A/B. The distinction from --cache-reuse matters: checkpoint-restore replays the full exact state, whereas KV-shift fakes token positions and silently corrupts the recurrent half.

3 · iGPU (Xe3 / Vulkan) & the 96 GB throughput lever

The premise that the accelerators are unreachable turned out to be false: the Xe3 is usable through llama.cpp's Vulkan backend (Mesa, no intel-compute-runtime needed; OpenCL NEO and Level-Zero GPU+NPU runtimes are in fact now installed). Raw throughput, measured at full offload:

stageCPU (live)iGPU Vulkanratio
prefill pp512~47 t/s124.42.6×
prefill pp2048~47 t/s115.42.5×
prefill pp8192~47 t/s120.72.6×
decode (batch 1)14.7 t/s8–100.6×
decode (4 concurrent)14.7 t/s16.9 agg1.2×

Prefill — the bottleneck — is 2.5× faster on the iGPU (the Xe3's KHR_coopmat fp16 matrix cores win the big GEMMs). Decode is memory-bandwidth-bound and slower at batch 1, but scales with concurrency (B1→B4 = 8.5→16.9 t/s aggregate) — the 96 GB continuous-batching lever, which would need a parallel runner driving the 4 server slots (sublinear, ~B2–3 on this box). On a prefill-dominated workload the net would be ~2× end-to-end.

But it is correctness-blocked today. The 2.5× requires full offload (-ngl 99), and on this Xe3/Vulkan stack two pieces break: (1) the 248k-vocab output projection garbles tokens (greedy decode answers "3." for "capital of France"); (2) the vision encoder crashes the GPU (vk::DeviceLostError in clip_image_batch_encode) — fatal, since every step needs vision. The only correct offload (output + vision on CPU) was measured slower than CPU on every axis. So CPU is the fastest correct path.

How narrow is the bug? llama.cpp's kernel test suite (test-backend-ops, synthetic tensors, no model) on Vulkan0: 947/947 MUL_MAT tests pass vs the CPU reference, and the broad sweep surfaced zero numerical failures — unimplemented ops merely report not supported → CPU fallback. So the corruption is not a general matmul defect; it is narrow (the untested very-large output shape, or an op fallback in the head), which makes it the kind of bug that gets fixed upstream — after which a ~2× GPU path opens. That is a research/upstream effort, not a config flag.

Bottom line

leveruses 96 GB?expectedstatus
Append-only / KV-stable contextyes (262K ctx ≈ 5 GB KV)4–10× prefill, runtime-agnosticnext-run A/B (accuracy)
Terminal slots + state checkpointyes (bounded 3-img KV)constant ~6k/step (vs 8–27k)buildable now (--ctx-checkpoints); A/B
--cache-reuseruled out (inert + unsound)
iGPU Vulkan offloadweights+KV in unified mem~2× end-to-endblocked (2 kernel bugs)
Continuous batchingyes (4× context)~1.8–2.5× throughput (B2–3)needs parallel runner

The actionable, hardware-agnostic win is KV-cache management: the OSWorld prefill wall is a quadratic artifact of in-place screenshot eviction, and the 96 GB unified memory — combined with the hybrid model's tiny KV footprint — makes an append-only, never-evict context affordable, turning the wall linear. The iGPU has the throughput to add a further ~2×, gated on two specific, now-localized llama.cpp/Mesa correctness bugs. Measured on the live NUC16 run; the replication itself is left untouched.