Holo3-35B-A3B · OSWorld-Verified · NUC16 — profiling corpus SCHEMA

what every benchmark step records, and why live tracker →  ·  ← research

What this is

Every step of the full 361-task OSWorld-Verified run of Holo3-35B-A3B on the Intel Panther Lake NUC (CPU-only llama.cpp) is instrumented end to end. The run reproduces the official leaderboard config (100-step budget, screenshot-only, single rollout) — but alongside the score it emits a step-indexed, replayable corpus of state, action, and measured cost.

Two granularities. The atomic record is one step (one model call); the unit a router actually reasons over is one whole task — a trajectory of steps from the first screenshot to answer, plus episode-level aggregates. The research target is an upper-layer model router that sends each step to an appropriately-sized model. The intended heuristic was large model for early planning, small model for the routine tail — and because the corpus is step-indexed, that hypothesis is directly testable. We test it below, and the data refutes the simple version.

Measured so far: 933 steps across 63 completed tasks (median 8, mean 14.8, max 100 steps/task). Projected to the full 361-task set: roughly 8,000–15,000 rows.

Captured from four time-aligned sources

One logical row = one agent step (one model call). It is assembled by joining these on (domain, task_id, step) and the 2 Hz telemetry on the ts window.

source filegraincontributes
raw/model_calls.jsonlper step tokens, server timings, tool, reasoning length, parse success
…/episode_profile.json → steps_profile[]per step latency split (predict / exec), 5-domain RAPL Joules
…/traj.jsonlper step action code, note, thought, screenshot filename, reward/done
raw/system_samples.csv2 Hz host telemetry: CPU/GPU/NPU, power, memory, thermals

Plus, on disk per step: the full-resolution screenshot PNG; per task: a screen recording (mp4) and the OSWorld config snapshot.

One row per agent step — fields grouped by role

Example values are real, sampled from chrome/06fe7178 step 1 (task: "Can you make my computer bring back the last tab I shut down?").

STATE — what the router conditions on
domainchrometask category (10 OSWorld domains)
task_id06fe7178-4491-…OSWorld example UUID
step1index within the episode — the "early vs late" planning axis
screenshot_filestep_1_…031052.pngfull-resolution observation, kept on disk for replay
prompt_tokens3003newly-prefilled (uncached) tokens this step — the prefill workload, = llama.cpp prompt_n
cached_tokens0KV prefix reused from prior step (0 = cold). Full context = prompt_tokens + cached_tokens
instruction"…bring back the last tab…"natural-language task string
ACTION — what the model did
toolkeyclick · double_click · right_click · drag · scroll · write · key · wait · answer
actionhotkey('ctrl','shift','t')executed pyautogui code (with pixel coords when spatial)
note"User wants to restore…"durable memory the agent carries forward
thought"…use Ctrl+Shift+T…"one-line stated reasoning
reasoning_len436chars of the hidden <think> block (text itself not saved — see gap)
COST — the price term of any routing objective
t_predict_s74.946model-call wall time
t_exec_s3.281environment action time
prefill_tps47.33prompt tokens / s
decode_tps17.67generated tokens / s
predicted_tokens198tokens generated this step
↳ time split (derived)prefill ≈ 63 s · decode ≈ 11 st_predict ≈ prompt_tokens/prefill_tps + predicted_tokens/decode_tps; prefill is ~70–90% of step time here
energy_j{pkg 3918.6, core 3659.2, uncore 1.1, dram 92.2, psys 5572.3}per-step Joules, 5 RAPL domains
LABEL — inherited by every step of the task
score1.0OSWorld execution-based final score (0/1)
done_by_agenttrueagent self-terminated vs hit the 100-step cap
steps2total episode length

Zooming out — one task = one episode

The row above is one step. A task is the whole sequence of those rows for one instruction, wrapped with episode-level aggregates — and it is the unit a router policy acts on. Below: the episode schema (real values from chrome/030eeff7) and that task's complete 14-step trajectory.

EPISODE — the unit the router acts on
domain / task_idchrome / 030eeff7-…one OSWorld example
instruction"enable 'Do Not Track' in Chrome…"the natural-language goal
score1.0execution-based final score (0/1)
steps14trajectory length
done_by_agenttrueself-terminated via answer (vs hitting the 100-cap)
wall_s2720.245 min end-to-end
energy_wh38.27package energy for the whole task
mean_predict_s189.38mean model-call time across the 14 steps
mean_decode_tps14.53mean generated tokens/s
mean_prompt_tokens6726mean context size across steps
steps_profile[][ 14 × step-row ]the full per-step table ↓ (the STATE/ACTION/COST schema above, ×14)

One complete trajectory — all 14 steps of chrome/030eeff7

Every per-step quantity a step-level router consumes, for one full episode — grouped output · KV state · cost coefficients · label. ctx = full context (prefilled + reused); reuse% = KV-hit rate; split = prefilldecode share of the step; switch s = full re-prefill penalty if the router swapped models at that step (= ctx ÷ pf t/s); score = task-level label (the only outcome signal — per-step quality is the missing counterfactual). reasoning bar = <think> chars vs the task max 1047. Scroll within the box; goal: enable Chrome's "Do Not Track".

output KV statecost coefficients label
steptoolreasoningpred ctxreusedreuse% pf t/sdec t/sprefill sdecode s splitswitch skJscorewhat the step did
1click1047438300700%44.216.768.026.2684.81.0open ⋮ menu — plan route to Settings
2click3642375271300357%36.115.162.815.71464.11.0click Settings
3click5172917507526770%34.514.564.920.12184.41.0go to Privacy & security
4scroll988436771994712%37.014.8183.229.520910.91.0DNT not visible — scroll to look
5click1018409797294712%37.714.6186.127.921110.91.0still hidden — open Privacy guide
6click749324819694712%38.414.6188.822.121310.91.0Next through guide
7click328240839494711%38.914.5191.716.521610.81.0Next
8click246230860094711%39.514.3193.716.121811.01.0Next (tersest step)
9click381257881094711%40.014.3196.517.922011.21.0Next
10click480299901694711%40.814.2198.021.022111.41.0guide done → back to settings
11click332268924994710%41.314.1201.019.022411.51.0open Third-party cookies
12click333277947994710%41.914.0203.719.822611.71.0found DNT toggle (Advanced)
13click523268971994710%42.513.9206.519.322911.91.0Confirm dialog
14answer315274991294710%42.913.7209.119.923112.11.0answer → DONE (toggle on)

Per-step time decomposed as prefill = prompt_tokens ÷ pf t/s and decode = predicted_tokens ÷ dec t/s; the remainder (HTTP + sampling) is <0.6 s every step. Task totals: prefill 2354 s (89%), decode 291 s (11%), other 6 s; 38.2 Wh.

Difficulty is scattered, not front-loaded. Heavy-reasoning bursts appear at step 1 (planning the route to Settings, 1047 chars) and again at steps 4–5 (re-planning when "Do Not Track" wasn't where the model expected) — interleaved with terse routine click-through (steps 7–9, down to 246). A router keyed on step position would miss the step-4 spike entirely; the signal it needs is per-step difficulty.
Prefill — not decode — is the cost. 89% of compute time is prefill, 11% decode (other <1%). The KV prefix grows healthily through step 3 (reused 0 → 5267), then the 3-screenshot eviction at step 4 collapses reuse to 947 for the rest of the run — so the model re-prefills ~7–8k tokens every step, even to press "Next." That alone more than doubles step time (≈85 → 215 s) and energy (≈4.4 → 11 kJ), independent of how hard the step is.
What a router needs from this — and the trap. A model router optimizes step cost, and the split says where it lives. Routing to a smaller model mainly speeds decode (its lever is the dec t/s coefficient) — but decode is only 11% here, so the ceiling on model-size routing is low unless the small model also prefills cheaper. The dominant term is prefill = prompt_tokens ÷ pf t/s, governed by context management (the reused column), not model choice. And the trap: switching models mid-trajectory invalidates the KV cache (it is weight-specific), forcing a full re-prefill of the whole context on the new model — at 89% prefill with reuse already fragile, a router that flips models often can cost more than it saves. So the trainable objective is per-step difficulty (does this step need the big model?) traded against switch cost (is the saving worth a re-prefill?), with cost = {prefill_s, decode_s, energy} decomposed per candidate model.

Corpus at a glance — 63 tasks, 933 steps so far

scored tasks63
total steps933
steps/task min · median · mean2 · 8 · 14.8
steps/task p90 · max32 · 100
episode lengthtaskssolve rate (score = 1.0)
1–3 steps475%
4–9 steps3275%
10–29 steps1958%
30–99 steps729%
100 (capped)10%

Solve rate collapses as tasks get longer — and long tasks are also where cost concentrates (latency rises with trajectory length). The long tail is the expensive, low-yield regime any efficiency layer most wants to act on.

Per task type — is prefill always the bottleneck? (as of 63 tasks / 933 steps)

All 10 OSWorld application domains, across every completed task. prefill% / decode% = share of compute time (prefill_s = prompt_tokens ÷ pf t/s, decode_s = predicted ÷ dec t/s); split shows it as prefilldecode. solve% is strict (score = 1.0). Sorted by prefill share.

domaindone / totalmean stepssolve% prefill%decode%splitmean reason
calc4 / 4725.5100%90.09.8462
thunderbird3 / 1519.0100%89.210.6593
chrome32 / 4616.763%88.711.1588
os3 / 2411.033%88.611.1667
multi_apps3 / 9313.767%88.511.2544
vlc3 / 1711.333%87.911.8568
writer4 / 2313.550%87.210.2476
gimp4 / 265.875%86.912.9543
impress4 / 479.025%86.613.2660
vs_code3 / 236.0100%86.110.3400
all63 / 36114.888.711.0
Yes — prefill dominates every task type. Prefill is 86–90% of compute in all 10 domains (overall 88.7%, decode 11.0%, other <0.5%); not one is decode-bound — fact-checked across all 933 steps. The prefill wall is structural, not specific to one kind of task.
Is there a reasoning pattern? Not a consistent one. Within-task reasoning-length trends diverge by task type: chrome (the only robust sample, n=20) is ~flat (548 → 553), libreoffice-calc falls (639 → 490), while a few rise (thunderbird +24%, vs_code +22%). But outside chrome each domain rests on just 1–4 successful tasks, so the rise/fall labels are largely anecdotal. The dependable cross-domain fact is the prefill share, not a reasoning shape — which again argues for routing on measured per-step difficulty over any positional or task-type heuristic.

One complete trajectory for every task type

A full per-step trajectory — like the chrome flagship above — for each of OSWorld's 10 domains (one representative completed task each, all fact-checked against the run). Click to expand. Columns are the router-relevant cost coefficients; reuse% = KV-hit rate, split = prefilldecode share of the step.

chromef79439ad · 12 steps · ✓ solved · 36 min · 30.4 Wh · prefill 91%

"Search for a one way flight from Dublin to Vienna on 10th next month for 2 adults."

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click65626730110%43.516.669.316.14.5Click 'Yes, I agree' to dismiss privacy dialog
2click564230522058%34.815.463.714.94.2Select 'One way' trip type
3click279166740270%33.414.465.511.54.2Click 'From' field to change departure airport
4click319181754313%36.115.0182.612.110.2Select Ireland in origin country list
5click496197769312%36.714.6183.913.510.3Pick Dublin as departure airport
6click336168782812%37.214.6185.011.510.3Select Austria in destination country list
7click307177797012%37.514.6187.212.110.4Pick Vienna as destination airport
8click611323812812%38.114.6188.622.110.9Choose July 10 in the date calendar
9click353215831811%38.514.6191.314.710.8Increase adult passenger count to 2
10click340196848711%39.114.4192.813.610.8Click 'Done' to close passenger dropdown
11click501231864711%39.614.3194.416.211.1Click 'Search' to run the flight search
12answer722384880111%40.114.3195.726.911.6Answer: found flight FR 7008, Dublin to Vienna
gimp77b8ab4d · 8 steps · ✓ solved · 21 min · 18.0 Wh · prefill 90%

"Could you assist me in placing my photo on the desktop and renaming it to export.jpg?"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click49424230070%44.017.268.314.14.4Open the File menu to find export option
2click380201519758%35.015.862.712.74.0Click 'Export As...' in the File menu
3click761339739370%34.414.563.923.44.5Click filename field in Export dialog to edit
4key378199758412%36.414.9182.213.410.2Select all existing filename text to replace it
5write316181773112%37.014.9183.112.110.2Type new filename 'export.jpg'
6click308175788312%37.514.8185.111.910.3Click Export to confirm Desktop save location
7click566240803612%37.914.6187.216.410.6Confirm JPEG export options dialog (quality 90)
8answer348194820212%38.314.3189.413.510.6Confirm image exported to Desktop/export.jpg
calc42e0a640 · 14 steps · ✓ solved · 45 min · 38.0 Wh · prefill 89%

"Compute the sum of "Revenue" and "Total Expenses" and put the results under two columns named "Total Revenue" and "Total Expenses" of a new sheet (Sheet2)"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click46025530260%44.217.168.514.94.4Click + tab to create a new Sheet2
2write477272525458%35.615.262.617.94.2Type 'Total Revenue' header into cell A1
3click442251749170%34.614.364.817.54.3Click cell B1 to add second header
4click736368768113%36.715.0183.124.610.7Retry clicking B1 (still on A1)
5click747353789612%37.414.8185.523.910.8Another click attempt to select cell B1
6click238244812312%38.114.7187.816.610.6Overshot to C1; click back toward B1
7write252203834812%38.814.7190.113.910.6B1 selected; type 'Total Expenses' header
8click228247853611%39.314.7192.516.810.9Click cell A2 for the revenue formula
9write404289876411%40.014.3194.820.111.2Enter =SUM(Sheet1.A2:A20) in cell A2
10key327234899311%40.714.3197.316.411.2Press Enter to execute revenue SUM formula
11click578330919211%41.114.1200.223.411.6Revenue=867786; click cell B2 next
12write395300941610%41.714.0202.721.411.7Enter =SUM(Sheet1.B2:B20) in cell B2
13key270227965110%42.114.0206.116.311.7Press Enter to execute expenses SUM formula
14answer513443985210%42.713.8208.032.212.5Done: A2=867786, B2=411686 under headers
impressaf23762e · 4 steps · ✓ solved · 9 min · 6.9 Wh · prefill 79%

"I am making PPT on LibreOffice Impress for presentation tomorrow. I need to summarize contents on one slide use Impress "Summary Slide" feature. Could you make that for me?"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click58327830250%44.217.268.516.24.5Open Insert menu hunting for Summary Slide option
2click1159385526757%35.615.463.125.04.5Insert menu lacks it; open Slide menu instead
3click811328749970%34.314.565.222.74.5Click Summary Slide option found in Slide menu
4answer1030509770213%36.814.8183.334.311.1Report summary slide created as new slide 13 of 13
writer0e47de2a · 12 steps · ✓ solved · 38 min · 32.3 Wh · prefill 88%

"Add page number for every page at the bottom left"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click33129029990%44.216.567.917.54.5Open the Insert menu to find page-number options
2click530296527457%36.015.563.419.14.3Click 'Page Number' in the open Insert menu
3key1016439753570%34.812.865.134.35.1Undo: it wrongly inserted a literal '1' in the title
4click474302779612%37.014.8185.220.410.6Undo done; reopen Insert to use Header/Footer path
5click690356802512%37.914.6186.824.410.9Click 'Header and Footer' submenu in Insert menu
6click332251827311%38.714.7189.717.110.8Select 'Footer' to enable the footer area
7click218259849111%39.214.5192.617.911.0Click 'Default Page Style' to turn on the footer
8click643327874111%39.914.3195.722.811.4Footer active; open Insert menu to add page number
9click507322897110%40.414.1199.022.911.6Click 'Page Number' to insert field into footer
10key947415922410%41.114.1201.629.412.0Page number '3' shows in footer; confirm insertion
11click276217946410%41.814.0203.915.511.6Click main body to exit footer editing mode
12answer860434966610%42.113.7207.131.712.5Footer page numbers verified; report task complete
multi_apps3680a5ee · 7 steps · ✓ solved · 19 min · 15.5 Wh · prefill 88%

"I have file1.xlsx and file2.ods on my Desktop, each containing a single column. Using only the command line, help me merge these two columns into a single column by concatenating the strings from both rows, save the result as ~/Desktop/output.csv, and open it in LibreOffice Calc from the terminal"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1write28421030530%44.416.568.712.74.3List Desktop to confirm file1.xlsx and file2.ods exist
2write348240526158%35.215.162.815.94.1Convert file1.xlsx to CSV via LibreOffice headless
3write317220749870%34.413.865.115.94.3Convert file2.ods to CSV via LibreOffice headless
4write1195414768313%36.414.8183.628.010.9Merge CSVs with paste -d '' into output.csv
5write181155786913%37.014.9186.010.410.4Open output.csv in LibreOffice Calc from terminal
6click439247802812%37.614.7186.916.810.6Click OK on Calc Text Import dialog to load data
7answer668375821512%38.314.5188.425.811.0Report task complete: merged file opened in Calc
os94d95f96 · 7 steps · ✗ score 0.00 · 20 min · 16.4 Wh · prefill 84%

"I want to install Spotify on my current system. Could you please help me?"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click80437930050%43.116.469.723.14.9Open Ubuntu Software Center from the dock
2click857405527157%35.715.463.626.24.6Click the search icon in the Snap Store
3write309214756670%33.914.467.814.84.4Type 'Spotify' into the search field
4wait282209775212%37.114.8183.714.110.3Wait for search results to finish loading
5click547346794812%37.714.6185.923.710.8Click the official Spotify app in results
6click785381822611%38.414.5189.526.211.1Click the green Install button
7answer1894707850011%38.514.4196.049.212.5Auth dialog appeared; report install initiated
thunderbird9bc3cc16 · 12 steps · ✓ solved · 37 min · 31.4 Wh · prefill 90%

"Could you help me back up all the email files in my inbox to ~/emails.bak? Please save them separately in eml format."

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click48228630160%44.317.368.116.64.4Click Inbox folder in sidebar to view emails
2key458250526157%35.115.364.116.34.2Press Ctrl+A to select all inbox messages
3right_click878374748170%33.813.865.727.04.8Right-click a selected email to open context menu
4click581300770312%36.314.7186.020.410.8Click 'Save As...' in the context menu
5click307227791312%36.014.8193.415.311.2Click Home in file chooser to reach home dir
6click264231810712%38.114.8187.915.610.6Click Create Folder button in file chooser
7write145151831611%38.614.6190.510.410.5Type 'emails.bak' as the new folder name
8click182183847111%39.214.6191.912.510.7Click Create to make the emails.bak folder
9click294234865011%39.714.4193.916.211.0Click Open to save emails into emails.bak
10click687350885711%40.414.3195.724.511.5Open file manager from dock to verify backup
11double_click255221909411%40.814.1199.415.611.3Double-click emails.bak folder to check contents
12answer812399929910%41.514.1201.328.411.9Confirm both emails saved as .eml; report done
vlc8ba5ae7a · 8 steps · ✓ solved · 22 min · 18.9 Wh · prefill 88%

"Help me modify the folder used to store my recordings to Desktop"

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1click22320330010%43.416.369.112.54.4Click Tools menu in VLC menu bar
2click402233524057%35.115.364.015.24.2Click Preferences in the Tools dropdown menu
3click289256747270%34.112.765.620.14.6Open Input/Codecs tab in Preferences dialog
4click498297771512%36.914.8183.820.010.7Click Browse next to Record directory field
5click454294794712%37.714.9185.819.810.6Select Desktop folder in file browser
6click431264818511%38.414.6188.718.010.7Click Choose to confirm Desktop directory
7click512284840311%39.114.5190.919.610.9Click Save to apply preference changes
8answer786392862911%39.714.3193.827.411.4Report recording folder set to Desktop, task done
vs_code0ed39f63 · 8 steps · ✓ solved · 22 min · 18.0 Wh · prefill 90%

"Please help me change all the places in this document that say "text" to "test"."

steptoolreasoningpredctxreuse%pf t/sdec t/sprefill sdecode ssplitkJwhat the step did
1key47323130080%44.317.167.913.54.3Open Find & Replace dialog with Ctrl+H
2write202168521458%35.315.462.510.93.8Type 'text' into the Find field
3click268198741270%34.314.664.313.54.1Click the Replace input field
4write291168758313%36.315.0183.011.210.3Type 'test' into the Replace field
5click328209771912%36.914.8183.314.110.2Click Replace All button
6key609274789612%37.514.8185.318.610.5Save the document with Ctrl+S
7key448217807412%38.014.8187.314.710.5Close Find & Replace dialog with Esc
8answer355304823612%38.514.6189.320.910.9Report all 10 'text' replaced with 'test'

Same shape in every domain: a cold, cheap first step, then a one-time prefill jump when the 3-screenshot window fills and KV reuse collapses to ~950 tokens — after which prefill dominates every step regardless of the action. os is shown as a score-0 near-miss; the cost profile is identical to the successes.

Attacking the prefill wall — KV-cache management & stage→device affinity

The diagnosis is the same everywhere above: ~89% of inference time is prefill, and the KV prefix dies at the 3-screenshot eviction (reuse 70% → 10%), so the agent re-prefills ~7–8k tokens every step. A model router only moves the 11% decode term — the dominant 89% needs a different attack. Two levers, meant to stack:

1 · KV-cache-stable context. Keep the prompt prefix invariant so screenshot eviction stops invalidating it — evict within a stable window, pin images at fixed offsets, or carry KV across steps — restoring the 70% reuse we briefly see at step 3 instead of the frozen 947. Corroboration from the companion OpenClaw prefix-cache experiment: auto-compaction that breaks the prefix cache produced a ≈24× prefill blow-up between two consecutive turns — the same failure mode, measured on a different stack. Estimated 2–3× faster steps here if the prefix is preserved. [measured diagnosis + measured corroboration]

2 · Stage→device affinity. Inference stages have different bottlenecks, so place each on the silicon that fits: vision-encode + prefill are compute-bound and parallel → an iGPU/NPU; decode is memory-bandwidth-bound and sequential → the CPU. Holo3 is prefill-dominated, so moving prefill onto a compute engine attacks the 89% directly. On this box that path is currently blocked (the Xe3 vision encoder crashes — see the live-tracker root-cause); the durable fix needs OpenVINO MoE-VLM support or higher-bandwidth edge silicon. [proposed; device bring-up measured]

3 · Compose with the router. Because prefill dominates and switching models invalidates the (weight-specific) KV cache, the layers are complementary, not competing: cache-stable context + stage affinity cut the 89% prefill; a difficulty-gated, switch-aware model router trims the 11% decode and skips steps that don't need the big model. Stack them — don't pick one.

Does effort front-load? — testing the routing hypothesis on real data

The motivation assumed early steps are heavy (planning) and the tail is routine. With a step-indexed corpus that is directly checkable. An adversarial pass over the 36 successful episodes computed each effort proxy by within-task position (terciles), per task, then averaged across tasks. The output-side proxies (reasoning chars, generated tokens) are used because — unlike latency — they do not inflate mechanically as the prompt grows.

within-task positionreasoning_lenpred tokens new tok (prefill)decode t/spredict s
first step482252299016.8382.7
early tercile515265327715.78100.2
mid tercile501266549514.32160.9
late tercile563 ↑320 ↑744814.29207.3
last step664 ↑410 ↑771614.12217.8
Verdict: refuted. Within successful tasks, output-side effort rises early → late, it does not fall: reasoning text +29% and generated tokens +33% (per-task early→late ratio on tasks ≥3 steps, rising in ~5 of 6). And the very first step is the lightest, not the heaviest — it has the shortest prompt and a cold KV cache, so it is the fastest step (≈83 s vs ≈207 s late). The clean "front-load the big model, shrink it for the tail" story does not hold here.
The dangerous failure mode for a position-based router. Failed runs do not taper to routine — they stay high or spike. The cap-bound looper holds reasoning_len ~657–690 late (peaking ~1144 mid-run) while degenerating into a click/scroll loop. A naïve "use the small model on late steps" policy would strip capability exactly where the agent is stuck and needs it most — turning recoverable runs into hard failures.
The defensible motivation. Route by per-step difficulty, not step position. Cheap orientation/confirmation steps genuinely exist and can run on a small model — but they are scattered throughout the trajectory (see the exemplar's step-7–9 click-through), not confined to a tail. And the router must never starve a stuck, high-reasoning step of capacity. (Latency is not a difficulty signal: per-step wall time rises 2.3× early→late only because prompt + KV grow 2.4×.)

Method: within each successful task, step i is normalized to (i−1)/(N−1) and binned early/mid/late by tercile, plus the literal first/last step; per-task means are averaged equally across the 36 tasks (a per-task trend, not a global step pool). n is small (median successful task ≈ 6 steps; 11 tasks have ≥9 steps), so treat magnitudes as directional.

Real episode — short success

chrome/06fe7178 · "bring back the last tab I shut down"
score 1.0 · 2 steps · 183 s wall · 2.22 Wh

steptoolnewreused preddec t/spredict sscore
1key30030 19817.6774.91.0
2answer→DONE21702999 26915.6375.91.0

new↻ = uncached tokens prefilled this step; reused = KV prefix carried over. Step 2 reuses 2999 tokens from step 1 — the rare case where the cache survives a step.

Real episode — cap-bound looper

chrome/99146c54 · "delete browsing data automatically on close"
score 0.0 · 100 steps · 8.9 h wall · 38.2 Wh

stepnewreusedpred dec t/spredict sscore
130070221 17.6875.50.0
5116484947400 9.80328.40.0
10026299947308 6.21520.50.0

KV reuse stays stuck at 947, so by step 100 the agent re-prefills 26k tokens every step (full context ≈ 27k) — prefill runs away and decode decays 17.7 → 6.2 t/s.

The contrast is the whole motivation in one picture. Step 1 of a 2-step win and step 1 of a 100-step failure are near-indistinguishable on cost (~75 s, ~3000 tokens, ~17.7 t/s) — so the routing signal isn't in the raw cost, it's in the state. And the looper shows the runaway the router exists to prevent: latency and decode collapsing 17.7 → 6.2 t/s as context grows.

Host telemetry — 2 Hz, joined on ts

cpu_utilcpu_mhzserver_rssmem_avail pkg_wpsys_wgpu_mhznpu_utilpkg_°C
20.8%215638.0 GB50.2 GB 16.229.30 (idle)0% (idle)78.0

The iGPU and NPU read idle by design — the device bring-up (on the live tracker) shows why this MoE-VLM serves only on CPU on this silicon.

What this trains — and what it doesn't (yet)

Toward the step-level model router — now motivated by per-step difficulty rather than the refuted early/late split — the corpus is necessary, not yet sufficient.

✓ Fully covered — the state & the cost. Every feature a router conditions on at decision time (step index, domain, context size, cached-token ratio, the screenshot, the message history, the tool needed) and the full price term (per-step latency and energy) is measured. Most routing work hand-waves the cost side; here it's instrumented.

✗ The missing piece — the label. This is a single-model, single-rollout, end-scored benchmark, which leaves two gaps:
  • No counterfactual. Every step ran on the 35B. Nothing says whether a smaller model would have produced an acceptable action at that same state — "big was necessary here" is unlabeled.
  • No per-step quality signal. OSWorld scores only the final state; intermediate reward is ~0. You get task-level success/failure, not action-level correctness.
What rescues it — the corpus is replayable. Because the screenshot and (reconstructable) message history are saved per step, you can generate the missing labels offline, without re-running OSWorld: replay each logged state through a smaller model, compare its action to the 35B's (action-match, or downstream-success via a verifier), and that becomes the supervised "big-needed" label. The single-model run can't produce routing labels by itself — but it's purpose-built to feed that offline pass.

Two honest caveats baked into the schema: (1) steps_profile.tool is always null (a known bug) — join the per-step tool from model_calls.jsonl/traj.jsonl instead; nothing is lost. (2) reasoning_len is the char count of the <think> block — the reasoning text is discarded, which is the single most valuable missing field for a planning-vs-routine classifier and is worth capturing in a re-run.

Part of System Optimization for Edge Device Inference · WukLab @ UC San Diego · advised by Yiying Zhang