Holo3 · OSWorld · NUC16

Progress

–

tasks done / total

–

success rate

–

scored

–

elapsed

–

ETA remaining

–

mean task wall

official Holo3-35B-A3B leaderboard runs: 82.56 / 78.15 % (mean 80.4) at the same 100-step budget

Current task

idle

Live system telemetry

Serving aggregate

Success rate by domain

Package power (W) & CPU util — last ~3 h

Tasks

Device bring-up — why CPU (GPU/NPU tried first)

…

Root cause — why the NPU and iGPU can't serve this model

NPU (Intel NPU 5, OpenVINO) — blocked at the software-stack level, twice over:

Architecture gap. Holo3 is qwen3_5_moe: a hybrid Gated-DeltaNet (linear-attention/SSM) × 256-expert sparse MoE vision-language model. The NPU is only reachable through OpenVINO, and optimum-intel/OpenVINO cannot export this architecture — OpenVINO GenAI's VLM pipeline supports only dense Qwen3-VL; no MoE-VLM, no DeltaNet ops.
Hardware ceiling. NPU 5 is a static-graph accelerator for small models; ~20 GB of INT4 MoE weights plus a growing KV/SSM state in token-by-token agentic decode is the opposite of its design point.

iGPU (Xe3, llama.cpp Vulkan) — three independent, measured failure layers:

Vision encoder crashes the device. The CLIP/mmproj graph raises vk::DeviceLostError inside clip_image_batch_encode → GPU reset. Driver/kernel immaturity on brand-new Xe3 silicon (mesa anv).
Output head computes garbage. Full offload (-ngl 99) yields deterministic wrong tokens at temp 0; isolated to the 248,320-vocab lm_head matmul — all 40 transformer blocks alone are numerically correct (-ngl 40). A silent Vulkan kernel bug (likely workgroup/dimension overflow on the huge vocab matrix); narrower repro than upstream issue #21888.
No physics win even when correct. Decoding a 3B-active MoE is memory-bandwidth-bound, and the iGPU shares the same ~90 GB/s DDR5-5600 as the CPU — it adds dispatch overhead on scattered expert reads without adding bandwidth: 10 tok/s vs 19.9 tok/s on CPU. The iGPU cannot win this workload on this silicon even bug-free.

Proposed solution (concise). Now, on this box: stay on CPU and recover the broken KV prefix-cache — restructure the agent's message history so screenshot eviction stops invalidating the prefix (measured ~25K-token re-prefills at ~60 tok/s dominate step latency; est. 2–3× faster steps), plus a loop-detector that early-FAILs runaway episodes. Upstream: file the two minimal repros (lm_head-only corruption; clip DeviceLost) — either fix unlocks vision-prefill offload to the iGPU. Right long-term fix: heterogeneous placement once OpenVINO gains MoE-VLM support — vision tower on NPU, compute-bound prefill on iGPU, bandwidth-bound expert decode on CPU — or edge silicon with real bandwidth headroom (≥200 GB/s class) where the GPU path pays off.