Every step of the full 361-task OSWorld-Verified run of Holo3-35B-A3B on the Intel Panther Lake NUC (CPU-only llama.cpp) is instrumented end to end. The run reproduces the official leaderboard config (100-step budget, screenshot-only, single rollout) — but alongside the score it emits a step-indexed, replayable corpus of state, action, and measured cost.
Two granularities. The atomic record is one step (one model call); the unit a router actually
reasons over is one whole task — a trajectory of steps from the first screenshot to answer,
plus episode-level aggregates. The research target is an upper-layer model router that sends each step
to an appropriately-sized model. The intended heuristic was large model for early planning, small model for
the routine tail — and because the corpus is step-indexed, that hypothesis is directly testable.
We test it below, and the data refutes the simple version.
Measured so far: 933 steps across 63 completed tasks (median 8, mean 14.8, max 100 steps/task). Projected to the full 361-task set: roughly 8,000–15,000 rows.
One logical row = one agent step (one model call). It is assembled by joining these on
(domain, task_id, step) and the 2 Hz telemetry on the ts window.
| source file | grain | contributes |
|---|---|---|
| raw/model_calls.jsonl | per step | tokens, server timings, tool, reasoning length, parse success |
| …/episode_profile.json → steps_profile[] | per step | latency split (predict / exec), 5-domain RAPL Joules |
| …/traj.jsonl | per step | action code, note, thought, screenshot filename, reward/done |
| raw/system_samples.csv | 2 Hz | host telemetry: CPU/GPU/NPU, power, memory, thermals |
Plus, on disk per step: the full-resolution screenshot PNG; per task: a screen recording (mp4) and the OSWorld config snapshot.
Example values are real, sampled from chrome/06fe7178 step 1
(task: "Can you make my computer bring back the last tab I shut down?").
prompt_nThe row above is one step. A task is the whole sequence of those rows for one instruction,
wrapped with episode-level aggregates — and it is the unit a router policy acts on. Below: the episode schema
(real values from chrome/030eeff7) and that task's complete 14-step trajectory.
answer (vs hitting the 100-cap)chrome/030eeff7Every per-step quantity a step-level router consumes, for one full episode — grouped
output · KV state ·
cost coefficients · label.
ctx = full context (prefilled + reused); reuse% = KV-hit rate;
split = prefilldecode share of the step;
switch s = full re-prefill penalty if the router swapped models at that step (= ctx ÷ pf t/s);
score = task-level label (the only outcome signal — per-step quality is the missing
counterfactual). reasoning bar = <think> chars vs the task max 1047. Scroll within the box;
goal: enable Chrome's "Do Not Track".
| output | KV state | cost coefficients | label | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| step | tool | reasoning | pred | ctx | reused | reuse% | pf t/s | dec t/s | prefill s | decode s | split | switch s | kJ | score | what the step did |
| 1 | click | 1047 | 438 | 3007 | 0 | 0% | 44.2 | 16.7 | 68.0 | 26.2 | 68 | 4.8 | 1.0 | open ⋮ menu — plan route to Settings | |
| 2 | click | 364 | 237 | 5271 | 3003 | 57% | 36.1 | 15.1 | 62.8 | 15.7 | 146 | 4.1 | 1.0 | click Settings | |
| 3 | click | 517 | 291 | 7507 | 5267 | 70% | 34.5 | 14.5 | 64.9 | 20.1 | 218 | 4.4 | 1.0 | go to Privacy & security | |
| 4 | scroll | 988 | 436 | 7719 | 947 | 12% | 37.0 | 14.8 | 183.2 | 29.5 | 209 | 10.9 | 1.0 | DNT not visible — scroll to look | |
| 5 | click | 1018 | 409 | 7972 | 947 | 12% | 37.7 | 14.6 | 186.1 | 27.9 | 211 | 10.9 | 1.0 | still hidden — open Privacy guide | |
| 6 | click | 749 | 324 | 8196 | 947 | 12% | 38.4 | 14.6 | 188.8 | 22.1 | 213 | 10.9 | 1.0 | Next through guide | |
| 7 | click | 328 | 240 | 8394 | 947 | 11% | 38.9 | 14.5 | 191.7 | 16.5 | 216 | 10.8 | 1.0 | Next | |
| 8 | click | 246 | 230 | 8600 | 947 | 11% | 39.5 | 14.3 | 193.7 | 16.1 | 218 | 11.0 | 1.0 | Next (tersest step) | |
| 9 | click | 381 | 257 | 8810 | 947 | 11% | 40.0 | 14.3 | 196.5 | 17.9 | 220 | 11.2 | 1.0 | Next | |
| 10 | click | 480 | 299 | 9016 | 947 | 11% | 40.8 | 14.2 | 198.0 | 21.0 | 221 | 11.4 | 1.0 | guide done → back to settings | |
| 11 | click | 332 | 268 | 9249 | 947 | 10% | 41.3 | 14.1 | 201.0 | 19.0 | 224 | 11.5 | 1.0 | open Third-party cookies | |
| 12 | click | 333 | 277 | 9479 | 947 | 10% | 41.9 | 14.0 | 203.7 | 19.8 | 226 | 11.7 | 1.0 | found DNT toggle (Advanced) | |
| 13 | click | 523 | 268 | 9719 | 947 | 10% | 42.5 | 13.9 | 206.5 | 19.3 | 229 | 11.9 | 1.0 | Confirm dialog | |
| 14 | answer | 315 | 274 | 9912 | 947 | 10% | 42.9 | 13.7 | 209.1 | 19.9 | 231 | 12.1 | 1.0 | answer → DONE (toggle on) | |
Per-step time decomposed as prefill = prompt_tokens ÷ pf t/s and decode = predicted_tokens ÷ dec t/s; the remainder (HTTP + sampling) is <0.6 s every step. Task totals: prefill 2354 s (89%), decode 291 s (11%), other 6 s; 38.2 Wh.
dec t/s coefficient) — but decode is only 11% here, so the ceiling on model-size routing is low
unless the small model also prefills cheaper. The dominant term is prefill = prompt_tokens ÷ pf t/s,
governed by context management (the reused column), not model choice. And the trap:
switching models mid-trajectory invalidates the KV cache (it is weight-specific), forcing a full
re-prefill of the whole context on the new model — at 89% prefill with reuse already fragile, a router that
flips models often can cost more than it saves. So the trainable objective is per-step difficulty
(does this step need the big model?) traded against switch cost (is the saving worth a re-prefill?),
with cost = {prefill_s, decode_s, energy} decomposed per candidate model.
| episode length | tasks | solve rate (score = 1.0) |
|---|---|---|
| 1–3 steps | 4 | 75% |
| 4–9 steps | 32 | 75% |
| 10–29 steps | 19 | 58% |
| 30–99 steps | 7 | 29% |
| 100 (capped) | 1 | 0% |
Solve rate collapses as tasks get longer — and long tasks are also where cost concentrates (latency rises with trajectory length). The long tail is the expensive, low-yield regime any efficiency layer most wants to act on.
All 10 OSWorld application domains, across every completed task. prefill% /
decode% = share of compute time (prefill_s = prompt_tokens ÷ pf t/s, decode_s = predicted ÷
dec t/s); split shows it as prefilldecode.
solve% is strict (score = 1.0). Sorted by prefill share.
| domain | done / total | mean steps | solve% | prefill% | decode% | split | mean reason |
|---|---|---|---|---|---|---|---|
| calc | 4 / 47 | 25.5 | 100% | 90.0 | 9.8 | 462 | |
| thunderbird | 3 / 15 | 19.0 | 100% | 89.2 | 10.6 | 593 | |
| chrome | 32 / 46 | 16.7 | 63% | 88.7 | 11.1 | 588 | |
| os | 3 / 24 | 11.0 | 33% | 88.6 | 11.1 | 667 | |
| multi_apps | 3 / 93 | 13.7 | 67% | 88.5 | 11.2 | 544 | |
| vlc | 3 / 17 | 11.3 | 33% | 87.9 | 11.8 | 568 | |
| writer | 4 / 23 | 13.5 | 50% | 87.2 | 10.2 | 476 | |
| gimp | 4 / 26 | 5.8 | 75% | 86.9 | 12.9 | 543 | |
| impress | 4 / 47 | 9.0 | 25% | 86.6 | 13.2 | 660 | |
| vs_code | 3 / 23 | 6.0 | 100% | 86.1 | 10.3 | 400 | |
| all | 63 / 361 | 14.8 | — | 88.7 | 11.0 | — |
A full per-step trajectory — like the chrome flagship above — for each of OSWorld's 10
domains (one representative completed task each, all fact-checked against the run). Click to expand.
Columns are the router-relevant cost coefficients; reuse% = KV-hit rate, split =
prefilldecode share of the step.
"Search for a one way flight from Dublin to Vienna on 10th next month for 2 adults."
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 656 | 267 | 3011 | 0% | 43.5 | 16.6 | 69.3 | 16.1 | 4.5 | Click 'Yes, I agree' to dismiss privacy dialog | |
| 2 | click | 564 | 230 | 5220 | 58% | 34.8 | 15.4 | 63.7 | 14.9 | 4.2 | Select 'One way' trip type | |
| 3 | click | 279 | 166 | 7402 | 70% | 33.4 | 14.4 | 65.5 | 11.5 | 4.2 | Click 'From' field to change departure airport | |
| 4 | click | 319 | 181 | 7543 | 13% | 36.1 | 15.0 | 182.6 | 12.1 | 10.2 | Select Ireland in origin country list | |
| 5 | click | 496 | 197 | 7693 | 12% | 36.7 | 14.6 | 183.9 | 13.5 | 10.3 | Pick Dublin as departure airport | |
| 6 | click | 336 | 168 | 7828 | 12% | 37.2 | 14.6 | 185.0 | 11.5 | 10.3 | Select Austria in destination country list | |
| 7 | click | 307 | 177 | 7970 | 12% | 37.5 | 14.6 | 187.2 | 12.1 | 10.4 | Pick Vienna as destination airport | |
| 8 | click | 611 | 323 | 8128 | 12% | 38.1 | 14.6 | 188.6 | 22.1 | 10.9 | Choose July 10 in the date calendar | |
| 9 | click | 353 | 215 | 8318 | 11% | 38.5 | 14.6 | 191.3 | 14.7 | 10.8 | Increase adult passenger count to 2 | |
| 10 | click | 340 | 196 | 8487 | 11% | 39.1 | 14.4 | 192.8 | 13.6 | 10.8 | Click 'Done' to close passenger dropdown | |
| 11 | click | 501 | 231 | 8647 | 11% | 39.6 | 14.3 | 194.4 | 16.2 | 11.1 | Click 'Search' to run the flight search | |
| 12 | answer | 722 | 384 | 8801 | 11% | 40.1 | 14.3 | 195.7 | 26.9 | 11.6 | Answer: found flight FR 7008, Dublin to Vienna |
"Could you assist me in placing my photo on the desktop and renaming it to export.jpg?"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 494 | 242 | 3007 | 0% | 44.0 | 17.2 | 68.3 | 14.1 | 4.4 | Open the File menu to find export option | |
| 2 | click | 380 | 201 | 5197 | 58% | 35.0 | 15.8 | 62.7 | 12.7 | 4.0 | Click 'Export As...' in the File menu | |
| 3 | click | 761 | 339 | 7393 | 70% | 34.4 | 14.5 | 63.9 | 23.4 | 4.5 | Click filename field in Export dialog to edit | |
| 4 | key | 378 | 199 | 7584 | 12% | 36.4 | 14.9 | 182.2 | 13.4 | 10.2 | Select all existing filename text to replace it | |
| 5 | write | 316 | 181 | 7731 | 12% | 37.0 | 14.9 | 183.1 | 12.1 | 10.2 | Type new filename 'export.jpg' | |
| 6 | click | 308 | 175 | 7883 | 12% | 37.5 | 14.8 | 185.1 | 11.9 | 10.3 | Click Export to confirm Desktop save location | |
| 7 | click | 566 | 240 | 8036 | 12% | 37.9 | 14.6 | 187.2 | 16.4 | 10.6 | Confirm JPEG export options dialog (quality 90) | |
| 8 | answer | 348 | 194 | 8202 | 12% | 38.3 | 14.3 | 189.4 | 13.5 | 10.6 | Confirm image exported to Desktop/export.jpg |
"Compute the sum of "Revenue" and "Total Expenses" and put the results under two columns named "Total Revenue" and "Total Expenses" of a new sheet (Sheet2)"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 460 | 255 | 3026 | 0% | 44.2 | 17.1 | 68.5 | 14.9 | 4.4 | Click + tab to create a new Sheet2 | |
| 2 | write | 477 | 272 | 5254 | 58% | 35.6 | 15.2 | 62.6 | 17.9 | 4.2 | Type 'Total Revenue' header into cell A1 | |
| 3 | click | 442 | 251 | 7491 | 70% | 34.6 | 14.3 | 64.8 | 17.5 | 4.3 | Click cell B1 to add second header | |
| 4 | click | 736 | 368 | 7681 | 13% | 36.7 | 15.0 | 183.1 | 24.6 | 10.7 | Retry clicking B1 (still on A1) | |
| 5 | click | 747 | 353 | 7896 | 12% | 37.4 | 14.8 | 185.5 | 23.9 | 10.8 | Another click attempt to select cell B1 | |
| 6 | click | 238 | 244 | 8123 | 12% | 38.1 | 14.7 | 187.8 | 16.6 | 10.6 | Overshot to C1; click back toward B1 | |
| 7 | write | 252 | 203 | 8348 | 12% | 38.8 | 14.7 | 190.1 | 13.9 | 10.6 | B1 selected; type 'Total Expenses' header | |
| 8 | click | 228 | 247 | 8536 | 11% | 39.3 | 14.7 | 192.5 | 16.8 | 10.9 | Click cell A2 for the revenue formula | |
| 9 | write | 404 | 289 | 8764 | 11% | 40.0 | 14.3 | 194.8 | 20.1 | 11.2 | Enter =SUM(Sheet1.A2:A20) in cell A2 | |
| 10 | key | 327 | 234 | 8993 | 11% | 40.7 | 14.3 | 197.3 | 16.4 | 11.2 | Press Enter to execute revenue SUM formula | |
| 11 | click | 578 | 330 | 9192 | 11% | 41.1 | 14.1 | 200.2 | 23.4 | 11.6 | Revenue=867786; click cell B2 next | |
| 12 | write | 395 | 300 | 9416 | 10% | 41.7 | 14.0 | 202.7 | 21.4 | 11.7 | Enter =SUM(Sheet1.B2:B20) in cell B2 | |
| 13 | key | 270 | 227 | 9651 | 10% | 42.1 | 14.0 | 206.1 | 16.3 | 11.7 | Press Enter to execute expenses SUM formula | |
| 14 | answer | 513 | 443 | 9852 | 10% | 42.7 | 13.8 | 208.0 | 32.2 | 12.5 | Done: A2=867786, B2=411686 under headers |
"I am making PPT on LibreOffice Impress for presentation tomorrow. I need to summarize contents on one slide use Impress "Summary Slide" feature. Could you make that for me?"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 583 | 278 | 3025 | 0% | 44.2 | 17.2 | 68.5 | 16.2 | 4.5 | Open Insert menu hunting for Summary Slide option | |
| 2 | click | 1159 | 385 | 5267 | 57% | 35.6 | 15.4 | 63.1 | 25.0 | 4.5 | Insert menu lacks it; open Slide menu instead | |
| 3 | click | 811 | 328 | 7499 | 70% | 34.3 | 14.5 | 65.2 | 22.7 | 4.5 | Click Summary Slide option found in Slide menu | |
| 4 | answer | 1030 | 509 | 7702 | 13% | 36.8 | 14.8 | 183.3 | 34.3 | 11.1 | Report summary slide created as new slide 13 of 13 |
"Add page number for every page at the bottom left"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 331 | 290 | 2999 | 0% | 44.2 | 16.5 | 67.9 | 17.5 | 4.5 | Open the Insert menu to find page-number options | |
| 2 | click | 530 | 296 | 5274 | 57% | 36.0 | 15.5 | 63.4 | 19.1 | 4.3 | Click 'Page Number' in the open Insert menu | |
| 3 | key | 1016 | 439 | 7535 | 70% | 34.8 | 12.8 | 65.1 | 34.3 | 5.1 | Undo: it wrongly inserted a literal '1' in the title | |
| 4 | click | 474 | 302 | 7796 | 12% | 37.0 | 14.8 | 185.2 | 20.4 | 10.6 | Undo done; reopen Insert to use Header/Footer path | |
| 5 | click | 690 | 356 | 8025 | 12% | 37.9 | 14.6 | 186.8 | 24.4 | 10.9 | Click 'Header and Footer' submenu in Insert menu | |
| 6 | click | 332 | 251 | 8273 | 11% | 38.7 | 14.7 | 189.7 | 17.1 | 10.8 | Select 'Footer' to enable the footer area | |
| 7 | click | 218 | 259 | 8491 | 11% | 39.2 | 14.5 | 192.6 | 17.9 | 11.0 | Click 'Default Page Style' to turn on the footer | |
| 8 | click | 643 | 327 | 8741 | 11% | 39.9 | 14.3 | 195.7 | 22.8 | 11.4 | Footer active; open Insert menu to add page number | |
| 9 | click | 507 | 322 | 8971 | 10% | 40.4 | 14.1 | 199.0 | 22.9 | 11.6 | Click 'Page Number' to insert field into footer | |
| 10 | key | 947 | 415 | 9224 | 10% | 41.1 | 14.1 | 201.6 | 29.4 | 12.0 | Page number '3' shows in footer; confirm insertion | |
| 11 | click | 276 | 217 | 9464 | 10% | 41.8 | 14.0 | 203.9 | 15.5 | 11.6 | Click main body to exit footer editing mode | |
| 12 | answer | 860 | 434 | 9666 | 10% | 42.1 | 13.7 | 207.1 | 31.7 | 12.5 | Footer page numbers verified; report task complete |
"I have file1.xlsx and file2.ods on my Desktop, each containing a single column. Using only the command line, help me merge these two columns into a single column by concatenating the strings from both rows, save the result as ~/Desktop/output.csv, and open it in LibreOffice Calc from the terminal"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | write | 284 | 210 | 3053 | 0% | 44.4 | 16.5 | 68.7 | 12.7 | 4.3 | List Desktop to confirm file1.xlsx and file2.ods exist | |
| 2 | write | 348 | 240 | 5261 | 58% | 35.2 | 15.1 | 62.8 | 15.9 | 4.1 | Convert file1.xlsx to CSV via LibreOffice headless | |
| 3 | write | 317 | 220 | 7498 | 70% | 34.4 | 13.8 | 65.1 | 15.9 | 4.3 | Convert file2.ods to CSV via LibreOffice headless | |
| 4 | write | 1195 | 414 | 7683 | 13% | 36.4 | 14.8 | 183.6 | 28.0 | 10.9 | Merge CSVs with paste -d '' into output.csv | |
| 5 | write | 181 | 155 | 7869 | 13% | 37.0 | 14.9 | 186.0 | 10.4 | 10.4 | Open output.csv in LibreOffice Calc from terminal | |
| 6 | click | 439 | 247 | 8028 | 12% | 37.6 | 14.7 | 186.9 | 16.8 | 10.6 | Click OK on Calc Text Import dialog to load data | |
| 7 | answer | 668 | 375 | 8215 | 12% | 38.3 | 14.5 | 188.4 | 25.8 | 11.0 | Report task complete: merged file opened in Calc |
"I want to install Spotify on my current system. Could you please help me?"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 804 | 379 | 3005 | 0% | 43.1 | 16.4 | 69.7 | 23.1 | 4.9 | Open Ubuntu Software Center from the dock | |
| 2 | click | 857 | 405 | 5271 | 57% | 35.7 | 15.4 | 63.6 | 26.2 | 4.6 | Click the search icon in the Snap Store | |
| 3 | write | 309 | 214 | 7566 | 70% | 33.9 | 14.4 | 67.8 | 14.8 | 4.4 | Type 'Spotify' into the search field | |
| 4 | wait | 282 | 209 | 7752 | 12% | 37.1 | 14.8 | 183.7 | 14.1 | 10.3 | Wait for search results to finish loading | |
| 5 | click | 547 | 346 | 7948 | 12% | 37.7 | 14.6 | 185.9 | 23.7 | 10.8 | Click the official Spotify app in results | |
| 6 | click | 785 | 381 | 8226 | 11% | 38.4 | 14.5 | 189.5 | 26.2 | 11.1 | Click the green Install button | |
| 7 | answer | 1894 | 707 | 8500 | 11% | 38.5 | 14.4 | 196.0 | 49.2 | 12.5 | Auth dialog appeared; report install initiated |
"Could you help me back up all the email files in my inbox to ~/emails.bak? Please save them separately in eml format."
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 482 | 286 | 3016 | 0% | 44.3 | 17.3 | 68.1 | 16.6 | 4.4 | Click Inbox folder in sidebar to view emails | |
| 2 | key | 458 | 250 | 5261 | 57% | 35.1 | 15.3 | 64.1 | 16.3 | 4.2 | Press Ctrl+A to select all inbox messages | |
| 3 | right_click | 878 | 374 | 7481 | 70% | 33.8 | 13.8 | 65.7 | 27.0 | 4.8 | Right-click a selected email to open context menu | |
| 4 | click | 581 | 300 | 7703 | 12% | 36.3 | 14.7 | 186.0 | 20.4 | 10.8 | Click 'Save As...' in the context menu | |
| 5 | click | 307 | 227 | 7913 | 12% | 36.0 | 14.8 | 193.4 | 15.3 | 11.2 | Click Home in file chooser to reach home dir | |
| 6 | click | 264 | 231 | 8107 | 12% | 38.1 | 14.8 | 187.9 | 15.6 | 10.6 | Click Create Folder button in file chooser | |
| 7 | write | 145 | 151 | 8316 | 11% | 38.6 | 14.6 | 190.5 | 10.4 | 10.5 | Type 'emails.bak' as the new folder name | |
| 8 | click | 182 | 183 | 8471 | 11% | 39.2 | 14.6 | 191.9 | 12.5 | 10.7 | Click Create to make the emails.bak folder | |
| 9 | click | 294 | 234 | 8650 | 11% | 39.7 | 14.4 | 193.9 | 16.2 | 11.0 | Click Open to save emails into emails.bak | |
| 10 | click | 687 | 350 | 8857 | 11% | 40.4 | 14.3 | 195.7 | 24.5 | 11.5 | Open file manager from dock to verify backup | |
| 11 | double_click | 255 | 221 | 9094 | 11% | 40.8 | 14.1 | 199.4 | 15.6 | 11.3 | Double-click emails.bak folder to check contents | |
| 12 | answer | 812 | 399 | 9299 | 10% | 41.5 | 14.1 | 201.3 | 28.4 | 11.9 | Confirm both emails saved as .eml; report done |
"Help me modify the folder used to store my recordings to Desktop"
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | click | 223 | 203 | 3001 | 0% | 43.4 | 16.3 | 69.1 | 12.5 | 4.4 | Click Tools menu in VLC menu bar | |
| 2 | click | 402 | 233 | 5240 | 57% | 35.1 | 15.3 | 64.0 | 15.2 | 4.2 | Click Preferences in the Tools dropdown menu | |
| 3 | click | 289 | 256 | 7472 | 70% | 34.1 | 12.7 | 65.6 | 20.1 | 4.6 | Open Input/Codecs tab in Preferences dialog | |
| 4 | click | 498 | 297 | 7715 | 12% | 36.9 | 14.8 | 183.8 | 20.0 | 10.7 | Click Browse next to Record directory field | |
| 5 | click | 454 | 294 | 7947 | 12% | 37.7 | 14.9 | 185.8 | 19.8 | 10.6 | Select Desktop folder in file browser | |
| 6 | click | 431 | 264 | 8185 | 11% | 38.4 | 14.6 | 188.7 | 18.0 | 10.7 | Click Choose to confirm Desktop directory | |
| 7 | click | 512 | 284 | 8403 | 11% | 39.1 | 14.5 | 190.9 | 19.6 | 10.9 | Click Save to apply preference changes | |
| 8 | answer | 786 | 392 | 8629 | 11% | 39.7 | 14.3 | 193.8 | 27.4 | 11.4 | Report recording folder set to Desktop, task done |
"Please help me change all the places in this document that say "text" to "test"."
| step | tool | reasoning | pred | ctx | reuse% | pf t/s | dec t/s | prefill s | decode s | split | kJ | what the step did |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | key | 473 | 231 | 3008 | 0% | 44.3 | 17.1 | 67.9 | 13.5 | 4.3 | Open Find & Replace dialog with Ctrl+H | |
| 2 | write | 202 | 168 | 5214 | 58% | 35.3 | 15.4 | 62.5 | 10.9 | 3.8 | Type 'text' into the Find field | |
| 3 | click | 268 | 198 | 7412 | 70% | 34.3 | 14.6 | 64.3 | 13.5 | 4.1 | Click the Replace input field | |
| 4 | write | 291 | 168 | 7583 | 13% | 36.3 | 15.0 | 183.0 | 11.2 | 10.3 | Type 'test' into the Replace field | |
| 5 | click | 328 | 209 | 7719 | 12% | 36.9 | 14.8 | 183.3 | 14.1 | 10.2 | Click Replace All button | |
| 6 | key | 609 | 274 | 7896 | 12% | 37.5 | 14.8 | 185.3 | 18.6 | 10.5 | Save the document with Ctrl+S | |
| 7 | key | 448 | 217 | 8074 | 12% | 38.0 | 14.8 | 187.3 | 14.7 | 10.5 | Close Find & Replace dialog with Esc | |
| 8 | answer | 355 | 304 | 8236 | 12% | 38.5 | 14.6 | 189.3 | 20.9 | 10.9 | Report all 10 'text' replaced with 'test' |
Same shape in every domain: a cold, cheap first step, then a one-time
prefill jump when the 3-screenshot window fills and KV reuse collapses to ~950 tokens — after which
prefill dominates every step regardless of the action. os is shown as a score-0
near-miss; the cost profile is identical to the successes.
The diagnosis is the same everywhere above: ~89% of inference time is prefill, and the KV prefix dies at the 3-screenshot eviction (reuse 70% → 10%), so the agent re-prefills ~7–8k tokens every step. A model router only moves the 11% decode term — the dominant 89% needs a different attack. Two levers, meant to stack:
1 · KV-cache-stable context. Keep the prompt prefix invariant so screenshot eviction stops invalidating it — evict within a stable window, pin images at fixed offsets, or carry KV across steps — restoring the 70% reuse we briefly see at step 3 instead of the frozen 947. Corroboration from the companion OpenClaw prefix-cache experiment: auto-compaction that breaks the prefix cache produced a ≈24× prefill blow-up between two consecutive turns — the same failure mode, measured on a different stack. Estimated 2–3× faster steps here if the prefix is preserved. [measured diagnosis + measured corroboration]
2 · Stage→device affinity. Inference stages have different bottlenecks, so place each on the silicon that fits: vision-encode + prefill are compute-bound and parallel → an iGPU/NPU; decode is memory-bandwidth-bound and sequential → the CPU. Holo3 is prefill-dominated, so moving prefill onto a compute engine attacks the 89% directly. On this box that path is currently blocked (the Xe3 vision encoder crashes — see the live-tracker root-cause); the durable fix needs OpenVINO MoE-VLM support or higher-bandwidth edge silicon. [proposed; device bring-up measured]
3 · Compose with the router. Because prefill dominates and switching models invalidates the (weight-specific) KV cache, the layers are complementary, not competing: cache-stable context + stage affinity cut the 89% prefill; a difficulty-gated, switch-aware model router trims the 11% decode and skips steps that don't need the big model. Stack them — don't pick one.
The motivation assumed early steps are heavy (planning) and the tail is routine. With a step-indexed corpus that is directly checkable. An adversarial pass over the 36 successful episodes computed each effort proxy by within-task position (terciles), per task, then averaged across tasks. The output-side proxies (reasoning chars, generated tokens) are used because — unlike latency — they do not inflate mechanically as the prompt grows.
| within-task position | reasoning_len | pred tokens | new tok (prefill) | decode t/s | predict s |
|---|---|---|---|---|---|
| first step | 482 | 252 | 2990 | 16.83 | 82.7 |
| early tercile | 515 | 265 | 3277 | 15.78 | 100.2 |
| mid tercile | 501 | 266 | 5495 | 14.32 | 160.9 |
| late tercile | 563 ↑ | 320 ↑ | 7448 | 14.29 | 207.3 |
| last step | 664 ↑ | 410 ↑ | 7716 | 14.12 | 217.8 |
reasoning_len ~657–690 late (peaking ~1144 mid-run)
while degenerating into a click/scroll loop. A naïve "use the small model on late steps" policy would strip
capability exactly where the agent is stuck and needs it most — turning recoverable runs into hard failures.
Method: within each successful task, step i is normalized to (i−1)/(N−1) and binned early/mid/late by tercile, plus the literal first/last step; per-task means are averaged equally across the 36 tasks (a per-task trend, not a global step pool). n is small (median successful task ≈ 6 steps; 11 tasks have ≥9 steps), so treat magnitudes as directional.
chrome/06fe7178 · "bring back the last tab I shut down"
score 1.0 · 2 steps · 183 s wall · 2.22 Wh
| step | tool | new↻ | reused | pred | dec t/s | predict s | score |
|---|---|---|---|---|---|---|---|
| 1 | key | 3003 | 0 | 198 | 17.67 | 74.9 | 1.0 |
| 2 | answer→DONE | 2170 | 2999 | 269 | 15.63 | 75.9 | 1.0 |
new↻ = uncached tokens prefilled this step; reused = KV prefix carried over. Step 2 reuses 2999 tokens from step 1 — the rare case where the cache survives a step.
chrome/99146c54 · "delete browsing data automatically on close"
score 0.0 · 100 steps · 8.9 h wall · 38.2 Wh
| step | new↻ | reused | pred | dec t/s | predict s | score |
|---|---|---|---|---|---|---|
| 1 | 3007 | 0 | 221 | 17.68 | 75.5 | 0.0 |
| 51 | 16484 | 947 | 400 | 9.80 | 328.4 | 0.0 |
| 100 | 26299 | 947 | 308 | 6.21 | 520.5 | 0.0 |
KV reuse stays stuck at 947, so by step 100 the agent re-prefills 26k tokens every step (full context ≈ 27k) — prefill runs away and decode decays 17.7 → 6.2 t/s.
ts| cpu_util | cpu_mhz | server_rss | mem_avail | pkg_w | psys_w | gpu_mhz | npu_util | pkg_°C |
|---|---|---|---|---|---|---|---|---|
| 20.8% | 2156 | 38.0 GB | 50.2 GB | 16.2 | 29.3 | 0 (idle) | 0% (idle) | 78.0 |
The iGPU and NPU read idle by design — the device bring-up (on the live tracker) shows why this MoE-VLM serves only on CPU on this silicon.
Toward the step-level model router — now motivated by per-step difficulty rather than the refuted early/late split — the corpus is necessary, not yet sufficient.
✓ Fully covered — the state & the cost. Every feature a router conditions on at decision time (step index, domain, context size, cached-token ratio, the screenshot, the message history, the tool needed) and the full price term (per-step latency and energy) is measured. Most routing work hand-waves the cost side; here it's instrumented.
Two honest caveats baked into the schema:
(1) steps_profile.tool is always null (a known bug) — join the per-step tool from
model_calls.jsonl/traj.jsonl instead; nothing is lost.
(2) reasoning_len is the char count of the <think> block — the reasoning text
is discarded, which is the single most valuable missing field for a planning-vs-routine classifier and
is worth capturing in a re-run.