Holo3 · OSWorld · NUC16 — profiling corpus

What this is

Every step of the full 361-task OSWorld-Verified run of Holo3-35B-A3B on the Intel Panther Lake NUC (CPU-only llama.cpp) is instrumented end to end. The run reproduces the official leaderboard config (100-step budget, screenshot-only, single rollout) — but alongside the score it emits a step-indexed, replayable corpus of state, action, and measured cost.

Two granularities. The atomic record is one step (one model call); the unit a router actually reasons over is one whole task — a trajectory of steps from the first screenshot to answer, plus episode-level aggregates. The research target is an upper-layer model router that sends each step to an appropriately-sized model. The intended heuristic was large model for early planning, small model for the routine tail — and because the corpus is step-indexed, that hypothesis is directly testable. We test it below, and the data refutes the simple version.

Measured so far: 933 steps across 63 completed tasks (median 8, mean 14.8, max 100 steps/task). Projected to the full 361-task set: roughly 8,000–15,000 rows.

Captured from four time-aligned sources

One logical row = one agent step (one model call). It is assembled by joining these on (domain, task_id, step) and the 2 Hz telemetry on the ts window.

source file	grain	contributes
raw/model_calls.jsonl	per step	tokens, server timings, tool, reasoning length, parse success
…/episode_profile.json → steps_profile[]	per step	latency split (predict / exec), 5-domain RAPL Joules
…/traj.jsonl	per step	action code, note, thought, screenshot filename, reward/done
raw/system_samples.csv	2 Hz	host telemetry: CPU/GPU/NPU, power, memory, thermals

Plus, on disk per step: the full-resolution screenshot PNG; per task: a screen recording (mp4) and the OSWorld config snapshot.

One row per agent step — fields grouped by role

Example values are real, sampled from chrome/06fe7178 step 1 (task: "Can you make my computer bring back the last tab I shut down?").

STATE — what the router conditions on

domainchrometask category (10 OSWorld domains)

task_id06fe7178-4491-…OSWorld example UUID

step1index within the episode — the "early vs late" planning axis

screenshot_filestep_1_…031052.pngfull-resolution observation, kept on disk for replay

prompt_tokens3003newly-prefilled (uncached) tokens this step — the prefill workload, = llama.cpp prompt_n

cached_tokens0KV prefix reused from prior step (0 = cold). Full context = prompt_tokens + cached_tokens

instruction"…bring back the last tab…"natural-language task string

ACTION — what the model did

toolkeyclick · double_click · right_click · drag · scroll · write · key · wait · answer

actionhotkey('ctrl','shift','t')executed pyautogui code (with pixel coords when spatial)

note"User wants to restore…"durable memory the agent carries forward

thought"…use Ctrl+Shift+T…"one-line stated reasoning

reasoning_len436chars of the hidden <think> block (text itself not saved — see gap)

COST — the price term of any routing objective

t_predict_s74.946model-call wall time

t_exec_s3.281environment action time

prefill_tps47.33prompt tokens / s

decode_tps17.67generated tokens / s

predicted_tokens198tokens generated this step

↳ time split (derived)prefill ≈ 63 s · decode ≈ 11 st_predict ≈ prompt_tokens/prefill_tps + predicted_tokens/decode_tps; prefill is ~70–90% of step time here

energy_j{pkg 3918.6, core 3659.2, uncore 1.1, dram 92.2, psys 5572.3}per-step Joules, 5 RAPL domains

LABEL — inherited by every step of the task

score1.0OSWorld execution-based final score (0/1)

done_by_agenttrueagent self-terminated vs hit the 100-step cap

steps2total episode length

Zooming out — one task = one episode

The row above is one step. A task is the whole sequence of those rows for one instruction, wrapped with episode-level aggregates — and it is the unit a router policy acts on. Below: the episode schema (real values from chrome/030eeff7) and that task's complete 14-step trajectory.

EPISODE — the unit the router acts on

domain / task_idchrome / 030eeff7-…one OSWorld example

instruction"enable 'Do Not Track' in Chrome…"the natural-language goal

score1.0execution-based final score (0/1)

steps14trajectory length

done_by_agenttrueself-terminated via answer (vs hitting the 100-cap)

wall_s2720.245 min end-to-end

energy_wh38.27package energy for the whole task

mean_predict_s189.38mean model-call time across the 14 steps

mean_decode_tps14.53mean generated tokens/s

mean_prompt_tokens6726mean context size across steps

steps_profile[][ 14 × step-row ]the full per-step table ↓ (the STATE/ACTION/COST schema above, ×14)

One complete trajectory — all 14 steps of `chrome/030eeff7`

Every per-step quantity a step-level router consumes, for one full episode — grouped output · KV state · cost coefficients · label. ctx = full context (prefilled + reused); reuse% = KV-hit rate; split = prefilldecode share of the step; switch s = full re-prefill penalty if the router swapped models at that step (= ctx ÷ pf t/s); score = task-level label (the only outcome signal — per-step quality is the missing counterfactual). reasoning bar = <think> chars vs the task max 1047. Scroll within the box; goal: enable Chrome's "Do Not Track".

		output		KV state			cost coefficients							label
step	tool	reasoning	pred	ctx	reused	reuse%	pf t/s	dec t/s	prefill s	decode s	split	switch s	kJ	score	what the step did
1	click	1047	438	3007	0	0%	44.2	16.7	68.0	26.2		68	4.8	1.0	open ⋮ menu — plan route to Settings
2	click	364	237	5271	3003	57%	36.1	15.1	62.8	15.7		146	4.1	1.0	click Settings
3	click	517	291	7507	5267	70%	34.5	14.5	64.9	20.1		218	4.4	1.0	go to Privacy & security
4	scroll	988	436	7719	947	12%	37.0	14.8	183.2	29.5		209	10.9	1.0	DNT not visible — scroll to look
5	click	1018	409	7972	947	12%	37.7	14.6	186.1	27.9		211	10.9	1.0	still hidden — open Privacy guide
6	click	749	324	8196	947	12%	38.4	14.6	188.8	22.1		213	10.9	1.0	Next through guide
7	click	328	240	8394	947	11%	38.9	14.5	191.7	16.5		216	10.8	1.0	Next
8	click	246	230	8600	947	11%	39.5	14.3	193.7	16.1		218	11.0	1.0	Next (tersest step)
9	click	381	257	8810	947	11%	40.0	14.3	196.5	17.9		220	11.2	1.0	Next
10	click	480	299	9016	947	11%	40.8	14.2	198.0	21.0		221	11.4	1.0	guide done → back to settings
11	click	332	268	9249	947	10%	41.3	14.1	201.0	19.0		224	11.5	1.0	open Third-party cookies
12	click	333	277	9479	947	10%	41.9	14.0	203.7	19.8		226	11.7	1.0	found DNT toggle (Advanced)
13	click	523	268	9719	947	10%	42.5	13.9	206.5	19.3		229	11.9	1.0	Confirm dialog
14	answer	315	274	9912	947	10%	42.9	13.7	209.1	19.9		231	12.1	1.0	answer → DONE (toggle on)

Per-step time decomposed as prefill = prompt_tokens ÷ pf t/s and decode = predicted_tokens ÷ dec t/s; the remainder (HTTP + sampling) is <0.6 s every step. Task totals: prefill 2354 s (89%), decode 291 s (11%), other 6 s; 38.2 Wh.

Difficulty is scattered, not front-loaded. Heavy-reasoning bursts appear at step 1 (planning the route to Settings, 1047 chars) and again at steps 4–5 (re-planning when "Do Not Track" wasn't where the model expected) — interleaved with terse routine click-through (steps 7–9, down to 246). A router keyed on step position would miss the step-4 spike entirely; the signal it needs is per-step difficulty.

Prefill — not decode — is the cost. 89% of compute time is prefill, 11% decode (other <1%). The KV prefix grows healthily through step 3 (reused 0 → 5267), then the 3-screenshot eviction at step 4 collapses reuse to 947 for the rest of the run — so the model re-prefills ~7–8k tokens every step, even to press "Next." That alone more than doubles step time (≈85 → 215 s) and energy (≈4.4 → 11 kJ), independent of how hard the step is.

What a router needs from this — and the trap. A model router optimizes step cost, and the split says where it lives. Routing to a smaller model mainly speeds decode (its lever is the dec t/s coefficient) — but decode is only 11% here, so the ceiling on model-size routing is low unless the small model also prefills cheaper. The dominant term is prefill = prompt_tokens ÷ pf t/s, governed by context management (the reused column), not model choice. And the trap: switching models mid-trajectory invalidates the KV cache (it is weight-specific), forcing a full re-prefill of the whole context on the new model — at 89% prefill with reuse already fragile, a router that flips models often can cost more than it saves. So the trainable objective is per-step difficulty (does this step need the big model?) traded against switch cost (is the saving worth a re-prefill?), with cost = {prefill_s, decode_s, energy} decomposed per candidate model.

Corpus at a glance — 63 tasks, 933 steps so far

scored tasks63

total steps933

steps/task min · median · mean2 · 8 · 14.8

steps/task p90 · max32 · 100

episode length	tasks	solve rate (score = 1.0)
1–3 steps	4	75%
4–9 steps	32	75%
10–29 steps	19	58%
30–99 steps	7	29%
100 (capped)	1	0%

Solve rate collapses as tasks get longer — and long tasks are also where cost concentrates (latency rises with trajectory length). The long tail is the expensive, low-yield regime any efficiency layer most wants to act on.

Per task type — is prefill always the bottleneck? (as of 63 tasks / 933 steps)

All 10 OSWorld application domains, across every completed task. prefill% / decode% = share of compute time (prefill_s = prompt_tokens ÷ pf t/s, decode_s = predicted ÷ dec t/s); split shows it as prefilldecode. solve% is strict (score = 1.0). Sorted by prefill share.

domain	done / total	mean steps	solve%	prefill%	decode%	mean reason
calc	4 / 47	25.5	100%	90.0	9.8	462
thunderbird	3 / 15	19.0	100%	89.2	10.6	593
chrome	32 / 46	16.7	63%	88.7	11.1	588
os	3 / 24	11.0	33%	88.6	11.1	667
multi_apps	3 / 93	13.7	67%	88.5	11.2	544
vlc	3 / 17	11.3	33%	87.9	11.8	568
writer	4 / 23	13.5	50%	87.2	10.2	476
gimp	4 / 26	5.8	75%	86.9	12.9	543
impress	4 / 47	9.0	25%	86.6	13.2	660
vs_code	3 / 23	6.0	100%	86.1	10.3	400
all	63 / 361	14.8	—	88.7	11.0	—

Yes — prefill dominates every task type. Prefill is 86–90% of compute in all 10 domains (overall 88.7%, decode 11.0%, other <0.5%); not one is decode-bound — fact-checked across all 933 steps. The prefill wall is structural, not specific to one kind of task.

Is there a reasoning pattern? Not a consistent one. Within-task reasoning-length trends diverge by task type: chrome (the only robust sample, n=20) is ~flat (548 → 553), libreoffice-calc falls (639 → 490), while a few rise (thunderbird +24%, vs_code +22%). But outside chrome each domain rests on just 1–4 successful tasks, so the rise/fall labels are largely anecdotal. The dependable cross-domain fact is the prefill share, not a reasoning shape — which again argues for routing on measured per-step difficulty over any positional or task-type heuristic.

One complete trajectory for every task type

A full per-step trajectory — like the chrome flagship above — for each of OSWorld's 10 domains (one representative completed task each, all fact-checked against the run). Click to expand. Columns are the router-relevant cost coefficients; reuse% = KV-hit rate, split = prefilldecode share of the step.

chromef79439ad · 12 steps · ✓ solved · 36 min · 30.4 Wh · prefill 91%

"Search for a one way flight from Dublin to Vienna on 10th next month for 2 adults."

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	656	267	3011	0%	43.5	16.6	69.3	16.1	4.5	Click 'Yes, I agree' to dismiss privacy dialog
2	click	564	230	5220	58%	34.8	15.4	63.7	14.9	4.2	Select 'One way' trip type
3	click	279	166	7402	70%	33.4	14.4	65.5	11.5	4.2	Click 'From' field to change departure airport
4	click	319	181	7543	13%	36.1	15.0	182.6	12.1	10.2	Select Ireland in origin country list
5	click	496	197	7693	12%	36.7	14.6	183.9	13.5	10.3	Pick Dublin as departure airport
6	click	336	168	7828	12%	37.2	14.6	185.0	11.5	10.3	Select Austria in destination country list
7	click	307	177	7970	12%	37.5	14.6	187.2	12.1	10.4	Pick Vienna as destination airport
8	click	611	323	8128	12%	38.1	14.6	188.6	22.1	10.9	Choose July 10 in the date calendar
9	click	353	215	8318	11%	38.5	14.6	191.3	14.7	10.8	Increase adult passenger count to 2
10	click	340	196	8487	11%	39.1	14.4	192.8	13.6	10.8	Click 'Done' to close passenger dropdown
11	click	501	231	8647	11%	39.6	14.3	194.4	16.2	11.1	Click 'Search' to run the flight search
12	answer	722	384	8801	11%	40.1	14.3	195.7	26.9	11.6	Answer: found flight FR 7008, Dublin to Vienna

gimp77b8ab4d · 8 steps · ✓ solved · 21 min · 18.0 Wh · prefill 90%

"Could you assist me in placing my photo on the desktop and renaming it to export.jpg?"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	494	242	3007	0%	44.0	17.2	68.3	14.1	4.4	Open the File menu to find export option
2	click	380	201	5197	58%	35.0	15.8	62.7	12.7	4.0	Click 'Export As...' in the File menu
3	click	761	339	7393	70%	34.4	14.5	63.9	23.4	4.5	Click filename field in Export dialog to edit
4	key	378	199	7584	12%	36.4	14.9	182.2	13.4	10.2	Select all existing filename text to replace it
5	write	316	181	7731	12%	37.0	14.9	183.1	12.1	10.2	Type new filename 'export.jpg'
6	click	308	175	7883	12%	37.5	14.8	185.1	11.9	10.3	Click Export to confirm Desktop save location
7	click	566	240	8036	12%	37.9	14.6	187.2	16.4	10.6	Confirm JPEG export options dialog (quality 90)
8	answer	348	194	8202	12%	38.3	14.3	189.4	13.5	10.6	Confirm image exported to Desktop/export.jpg

calc42e0a640 · 14 steps · ✓ solved · 45 min · 38.0 Wh · prefill 89%

"Compute the sum of "Revenue" and "Total Expenses" and put the results under two columns named "Total Revenue" and "Total Expenses" of a new sheet (Sheet2)"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	460	255	3026	0%	44.2	17.1	68.5	14.9	4.4	Click + tab to create a new Sheet2
2	write	477	272	5254	58%	35.6	15.2	62.6	17.9	4.2	Type 'Total Revenue' header into cell A1
3	click	442	251	7491	70%	34.6	14.3	64.8	17.5	4.3	Click cell B1 to add second header
4	click	736	368	7681	13%	36.7	15.0	183.1	24.6	10.7	Retry clicking B1 (still on A1)
5	click	747	353	7896	12%	37.4	14.8	185.5	23.9	10.8	Another click attempt to select cell B1
6	click	238	244	8123	12%	38.1	14.7	187.8	16.6	10.6	Overshot to C1; click back toward B1
7	write	252	203	8348	12%	38.8	14.7	190.1	13.9	10.6	B1 selected; type 'Total Expenses' header
8	click	228	247	8536	11%	39.3	14.7	192.5	16.8	10.9	Click cell A2 for the revenue formula
9	write	404	289	8764	11%	40.0	14.3	194.8	20.1	11.2	Enter =SUM(Sheet1.A2:A20) in cell A2
10	key	327	234	8993	11%	40.7	14.3	197.3	16.4	11.2	Press Enter to execute revenue SUM formula
11	click	578	330	9192	11%	41.1	14.1	200.2	23.4	11.6	Revenue=867786; click cell B2 next
12	write	395	300	9416	10%	41.7	14.0	202.7	21.4	11.7	Enter =SUM(Sheet1.B2:B20) in cell B2
13	key	270	227	9651	10%	42.1	14.0	206.1	16.3	11.7	Press Enter to execute expenses SUM formula
14	answer	513	443	9852	10%	42.7	13.8	208.0	32.2	12.5	Done: A2=867786, B2=411686 under headers

impressaf23762e · 4 steps · ✓ solved · 9 min · 6.9 Wh · prefill 79%

"I am making PPT on LibreOffice Impress for presentation tomorrow. I need to summarize contents on one slide use Impress "Summary Slide" feature. Could you make that for me?"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	583	278	3025	0%	44.2	17.2	68.5	16.2	4.5	Open Insert menu hunting for Summary Slide option
2	click	1159	385	5267	57%	35.6	15.4	63.1	25.0	4.5	Insert menu lacks it; open Slide menu instead
3	click	811	328	7499	70%	34.3	14.5	65.2	22.7	4.5	Click Summary Slide option found in Slide menu
4	answer	1030	509	7702	13%	36.8	14.8	183.3	34.3	11.1	Report summary slide created as new slide 13 of 13

writer0e47de2a · 12 steps · ✓ solved · 38 min · 32.3 Wh · prefill 88%

"Add page number for every page at the bottom left"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	331	290	2999	0%	44.2	16.5	67.9	17.5	4.5	Open the Insert menu to find page-number options
2	click	530	296	5274	57%	36.0	15.5	63.4	19.1	4.3	Click 'Page Number' in the open Insert menu
3	key	1016	439	7535	70%	34.8	12.8	65.1	34.3	5.1	Undo: it wrongly inserted a literal '1' in the title
4	click	474	302	7796	12%	37.0	14.8	185.2	20.4	10.6	Undo done; reopen Insert to use Header/Footer path
5	click	690	356	8025	12%	37.9	14.6	186.8	24.4	10.9	Click 'Header and Footer' submenu in Insert menu
6	click	332	251	8273	11%	38.7	14.7	189.7	17.1	10.8	Select 'Footer' to enable the footer area
7	click	218	259	8491	11%	39.2	14.5	192.6	17.9	11.0	Click 'Default Page Style' to turn on the footer
8	click	643	327	8741	11%	39.9	14.3	195.7	22.8	11.4	Footer active; open Insert menu to add page number
9	click	507	322	8971	10%	40.4	14.1	199.0	22.9	11.6	Click 'Page Number' to insert field into footer
10	key	947	415	9224	10%	41.1	14.1	201.6	29.4	12.0	Page number '3' shows in footer; confirm insertion
11	click	276	217	9464	10%	41.8	14.0	203.9	15.5	11.6	Click main body to exit footer editing mode
12	answer	860	434	9666	10%	42.1	13.7	207.1	31.7	12.5	Footer page numbers verified; report task complete

multi_apps3680a5ee · 7 steps · ✓ solved · 19 min · 15.5 Wh · prefill 88%

"I have file1.xlsx and file2.ods on my Desktop, each containing a single column. Using only the command line, help me merge these two columns into a single column by concatenating the strings from both rows, save the result as ~/Desktop/output.csv, and open it in LibreOffice Calc from the terminal"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	write	284	210	3053	0%	44.4	16.5	68.7	12.7	4.3	List Desktop to confirm file1.xlsx and file2.ods exist
2	write	348	240	5261	58%	35.2	15.1	62.8	15.9	4.1	Convert file1.xlsx to CSV via LibreOffice headless
3	write	317	220	7498	70%	34.4	13.8	65.1	15.9	4.3	Convert file2.ods to CSV via LibreOffice headless
4	write	1195	414	7683	13%	36.4	14.8	183.6	28.0	10.9	Merge CSVs with paste -d '' into output.csv
5	write	181	155	7869	13%	37.0	14.9	186.0	10.4	10.4	Open output.csv in LibreOffice Calc from terminal
6	click	439	247	8028	12%	37.6	14.7	186.9	16.8	10.6	Click OK on Calc Text Import dialog to load data
7	answer	668	375	8215	12%	38.3	14.5	188.4	25.8	11.0	Report task complete: merged file opened in Calc

os94d95f96 · 7 steps · ✗ score 0.00 · 20 min · 16.4 Wh · prefill 84%

"I want to install Spotify on my current system. Could you please help me?"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	804	379	3005	0%	43.1	16.4	69.7	23.1	4.9	Open Ubuntu Software Center from the dock
2	click	857	405	5271	57%	35.7	15.4	63.6	26.2	4.6	Click the search icon in the Snap Store
3	write	309	214	7566	70%	33.9	14.4	67.8	14.8	4.4	Type 'Spotify' into the search field
4	wait	282	209	7752	12%	37.1	14.8	183.7	14.1	10.3	Wait for search results to finish loading
5	click	547	346	7948	12%	37.7	14.6	185.9	23.7	10.8	Click the official Spotify app in results
6	click	785	381	8226	11%	38.4	14.5	189.5	26.2	11.1	Click the green Install button
7	answer	1894	707	8500	11%	38.5	14.4	196.0	49.2	12.5	Auth dialog appeared; report install initiated

thunderbird9bc3cc16 · 12 steps · ✓ solved · 37 min · 31.4 Wh · prefill 90%

"Could you help me back up all the email files in my inbox to ~/emails.bak? Please save them separately in eml format."

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	482	286	3016	0%	44.3	17.3	68.1	16.6	4.4	Click Inbox folder in sidebar to view emails
2	key	458	250	5261	57%	35.1	15.3	64.1	16.3	4.2	Press Ctrl+A to select all inbox messages
3	right_click	878	374	7481	70%	33.8	13.8	65.7	27.0	4.8	Right-click a selected email to open context menu
4	click	581	300	7703	12%	36.3	14.7	186.0	20.4	10.8	Click 'Save As...' in the context menu
5	click	307	227	7913	12%	36.0	14.8	193.4	15.3	11.2	Click Home in file chooser to reach home dir
6	click	264	231	8107	12%	38.1	14.8	187.9	15.6	10.6	Click Create Folder button in file chooser
7	write	145	151	8316	11%	38.6	14.6	190.5	10.4	10.5	Type 'emails.bak' as the new folder name
8	click	182	183	8471	11%	39.2	14.6	191.9	12.5	10.7	Click Create to make the emails.bak folder
9	click	294	234	8650	11%	39.7	14.4	193.9	16.2	11.0	Click Open to save emails into emails.bak
10	click	687	350	8857	11%	40.4	14.3	195.7	24.5	11.5	Open file manager from dock to verify backup
11	double_click	255	221	9094	11%	40.8	14.1	199.4	15.6	11.3	Double-click emails.bak folder to check contents
12	answer	812	399	9299	10%	41.5	14.1	201.3	28.4	11.9	Confirm both emails saved as .eml; report done

vlc8ba5ae7a · 8 steps · ✓ solved · 22 min · 18.9 Wh · prefill 88%

"Help me modify the folder used to store my recordings to Desktop"

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	click	223	203	3001	0%	43.4	16.3	69.1	12.5	4.4	Click Tools menu in VLC menu bar
2	click	402	233	5240	57%	35.1	15.3	64.0	15.2	4.2	Click Preferences in the Tools dropdown menu
3	click	289	256	7472	70%	34.1	12.7	65.6	20.1	4.6	Open Input/Codecs tab in Preferences dialog
4	click	498	297	7715	12%	36.9	14.8	183.8	20.0	10.7	Click Browse next to Record directory field
5	click	454	294	7947	12%	37.7	14.9	185.8	19.8	10.6	Select Desktop folder in file browser
6	click	431	264	8185	11%	38.4	14.6	188.7	18.0	10.7	Click Choose to confirm Desktop directory
7	click	512	284	8403	11%	39.1	14.5	190.9	19.6	10.9	Click Save to apply preference changes
8	answer	786	392	8629	11%	39.7	14.3	193.8	27.4	11.4	Report recording folder set to Desktop, task done

vs_code0ed39f63 · 8 steps · ✓ solved · 22 min · 18.0 Wh · prefill 90%

"Please help me change all the places in this document that say "text" to "test"."

step	tool	reasoning	pred	ctx	reuse%	pf t/s	dec t/s	prefill s	decode s	kJ	what the step did
1	key	473	231	3008	0%	44.3	17.1	67.9	13.5	4.3	Open Find & Replace dialog with Ctrl+H
2	write	202	168	5214	58%	35.3	15.4	62.5	10.9	3.8	Type 'text' into the Find field
3	click	268	198	7412	70%	34.3	14.6	64.3	13.5	4.1	Click the Replace input field
4	write	291	168	7583	13%	36.3	15.0	183.0	11.2	10.3	Type 'test' into the Replace field
5	click	328	209	7719	12%	36.9	14.8	183.3	14.1	10.2	Click Replace All button
6	key	609	274	7896	12%	37.5	14.8	185.3	18.6	10.5	Save the document with Ctrl+S
7	key	448	217	8074	12%	38.0	14.8	187.3	14.7	10.5	Close Find & Replace dialog with Esc
8	answer	355	304	8236	12%	38.5	14.6	189.3	20.9	10.9	Report all 10 'text' replaced with 'test'

Same shape in every domain: a cold, cheap first step, then a one-time prefill jump when the 3-screenshot window fills and KV reuse collapses to ~950 tokens — after which prefill dominates every step regardless of the action. os is shown as a score-0 near-miss; the cost profile is identical to the successes.

Attacking the prefill wall — KV-cache management & stage→device affinity

The diagnosis is the same everywhere above: ~89% of inference time is prefill, and the KV prefix dies at the 3-screenshot eviction (reuse 70% → 10%), so the agent re-prefills ~7–8k tokens every step. A model router only moves the 11% decode term — the dominant 89% needs a different attack. Two levers, meant to stack:

1 · KV-cache-stable context. Keep the prompt prefix invariant so screenshot eviction stops invalidating it — evict within a stable window, pin images at fixed offsets, or carry KV across steps — restoring the 70% reuse we briefly see at step 3 instead of the frozen 947. Corroboration from the companion OpenClaw prefix-cache experiment: auto-compaction that breaks the prefix cache produced a ≈24× prefill blow-up between two consecutive turns — the same failure mode, measured on a different stack. Estimated 2–3× faster steps here if the prefix is preserved. [measured diagnosis + measured corroboration]

2 · Stage→device affinity. Inference stages have different bottlenecks, so place each on the silicon that fits: vision-encode + prefill are compute-bound and parallel → an iGPU/NPU; decode is memory-bandwidth-bound and sequential → the CPU. Holo3 is prefill-dominated, so moving prefill onto a compute engine attacks the 89% directly. On this box that path is currently blocked (the Xe3 vision encoder crashes — see the live-tracker root-cause); the durable fix needs OpenVINO MoE-VLM support or higher-bandwidth edge silicon. [proposed; device bring-up measured]

3 · Compose with the router. Because prefill dominates and switching models invalidates the (weight-specific) KV cache, the layers are complementary, not competing: cache-stable context + stage affinity cut the 89% prefill; a difficulty-gated, switch-aware model router trims the 11% decode and skips steps that don't need the big model. Stack them — don't pick one.

Does effort front-load? — testing the routing hypothesis on real data

The motivation assumed early steps are heavy (planning) and the tail is routine. With a step-indexed corpus that is directly checkable. An adversarial pass over the 36 successful episodes computed each effort proxy by within-task position (terciles), per task, then averaged across tasks. The output-side proxies (reasoning chars, generated tokens) are used because — unlike latency — they do not inflate mechanically as the prompt grows.

within-task position	reasoning_len	pred tokens	new tok (prefill)	decode t/s	predict s
first step	482	252	2990	16.83	82.7
early tercile	515	265	3277	15.78	100.2
mid tercile	501	266	5495	14.32	160.9
late tercile	563 ↑	320 ↑	7448	14.29	207.3
last step	664 ↑	410 ↑	7716	14.12	217.8

Verdict: refuted. Within successful tasks, output-side effort rises early → late, it does not fall: reasoning text +29% and generated tokens +33% (per-task early→late ratio on tasks ≥3 steps, rising in ~5 of 6). And the very first step is the lightest, not the heaviest — it has the shortest prompt and a cold KV cache, so it is the fastest step (≈83 s vs ≈207 s late). The clean "front-load the big model, shrink it for the tail" story does not hold here.

The dangerous failure mode for a position-based router. Failed runs do not taper to routine — they stay high or spike. The cap-bound looper holds reasoning_len ~657–690 late (peaking ~1144 mid-run) while degenerating into a click/scroll loop. A naïve "use the small model on late steps" policy would strip capability exactly where the agent is stuck and needs it most — turning recoverable runs into hard failures.

The defensible motivation. Route by per-step difficulty, not step position. Cheap orientation/confirmation steps genuinely exist and can run on a small model — but they are scattered throughout the trajectory (see the exemplar's step-7–9 click-through), not confined to a tail. And the router must never starve a stuck, high-reasoning step of capacity. (Latency is not a difficulty signal: per-step wall time rises 2.3× early→late only because prompt + KV grow 2.4×.)

Method: within each successful task, step i is normalized to (i−1)/(N−1) and binned early/mid/late by tercile, plus the literal first/last step; per-task means are averaged equally across the 36 tasks (a per-task trend, not a global step pool). n is small (median successful task ≈ 6 steps; 11 tasks have ≥9 steps), so treat magnitudes as directional.

Real episode — short success

chrome/06fe7178 · "bring back the last tab I shut down"
score 1.0 · 2 steps · 183 s wall · 2.22 Wh

step	tool	new↻	reused	pred	dec t/s	predict s	score
1	key	3003	0	198	17.67	74.9	1.0
2	answer→DONE	2170	2999	269	15.63	75.9	1.0

new↻ = uncached tokens prefilled this step; reused = KV prefix carried over. Step 2 reuses 2999 tokens from step 1 — the rare case where the cache survives a step.

Real episode — cap-bound looper

chrome/99146c54 · "delete browsing data automatically on close"
score 0.0 · 100 steps · 8.9 h wall · 38.2 Wh

step	new↻	reused	pred	dec t/s	predict s
1	3007	0	221	17.68	75.5
51	16484	947	400	9.80	328.4
100	26299	947	308	6.21	520.5

KV reuse stays stuck at 947, so by step 100 the agent re-prefills 26k tokens every step (full context ≈ 27k) — prefill runs away and decode decays 17.7 → 6.2 t/s.

The contrast is the whole motivation in one picture. Step 1 of a 2-step win and step 1 of a 100-step failure are near-indistinguishable on cost (~75 s, ~3000 tokens, ~17.7 t/s) — so the routing signal isn't in the raw cost, it's in the state. And the looper shows the runaway the router exists to prevent: latency 7× and decode collapsing 17.7 → 6.2 t/s as context grows.

Host telemetry — 2 Hz, joined on `ts`

cpu_util	cpu_mhz	server_rss	mem_avail	pkg_w	psys_w	gpu_mhz	npu_util	pkg_°C
20.8%	2156	38.0 GB	50.2 GB	16.2	29.3	0 (idle)	0% (idle)	78.0

The iGPU and NPU read idle by design — the device bring-up (on the live tracker) shows why this MoE-VLM serves only on CPU on this silicon.

What this trains — and what it doesn't (yet)

Toward the step-level model router — now motivated by per-step difficulty rather than the refuted early/late split — the corpus is necessary, not yet sufficient.

✓ Fully covered — the state & the cost. Every feature a router conditions on at decision time (step index, domain, context size, cached-token ratio, the screenshot, the message history, the tool needed) and the full price term (per-step latency and energy) is measured. Most routing work hand-waves the cost side; here it's instrumented.

✗ The missing piece — the label. This is a single-model, single-rollout, end-scored benchmark, which leaves two gaps:

No counterfactual. Every step ran on the 35B. Nothing says whether a smaller model would have produced an acceptable action at that same state — "big was necessary here" is unlabeled.
No per-step quality signal. OSWorld scores only the final state; intermediate reward is ~0. You get task-level success/failure, not action-level correctness.

What rescues it — the corpus is replayable. Because the screenshot and (reconstructable) message history are saved per step, you can generate the missing labels offline, without re-running OSWorld: replay each logged state through a smaller model, compare its action to the 35B's (action-match, or downstream-success via a verifier), and that becomes the supervised "big-needed" label. The single-model run can't produce routing labels by itself — but it's purpose-built to feed that offline pass.

Two honest caveats baked into the schema: (1) steps_profile.tool is always null (a known bug) — join the per-step tool from model_calls.jsonl/traj.jsonl instead; nothing is lost. (2) reasoning_len is the char count of the <think> block — the reasoning text is discarded, which is the single most valuable missing field for a planning-vs-routine classifier and is worth capturing in a re-run.

Part of System Optimization for Edge Device Inference · WukLab @ UC San Diego · advised by Yiying Zhang