Virtual memory, the TLB, three flavors of L1 cache (VIVT, PIPT, VIPT), main memory, and disk — one virtual address, traced end-to-end through every box that handles it.
A load instruction issues a 32-bit virtual address. Depending on which cache structure the CPU uses, the address is split differently and the lookup races (or follows) translation. The picture below shows the data path for all three indexing schemes side by side.
Where the index and tag come from is the whole story. Assume a 32-bit VA, 4 KiB pages (12-bit page offset), and a 32 KiB 4-way set-associative L1 with 64 B lines (64 sets โ 6-bit index, 6-bit block offset). That leaves 20 tag bits.
In VIPT the 6 index bits all come from the page offset (bits 6โ11), so they are invariant under translation. That's the trick: index lookup can fire on the raw VA without waiting for the TLB. Total index + offset bits โค page-offset bits is the design constraint — widen the cache and you lose the property.
| VIVT | PIPT | VIPT | |
|---|---|---|---|
| Critical-path TLB? | No (only on miss) | Yes | No (TLB runs in parallel) |
| Aliasing (synonyms)? | Yes — two VAs โ same PA can sit in two ways | No | Yes iff index bits exceed page offset |
| Homonyms (same VA, different process)? | Yes — needs PID tags or flush on context switch | No | No |
| Permission bits checked when? | Late (on miss path) | Early (TLB) | Early (TLB, parallel) |
| Max practical cache size | Limited by process-tag width | Unrestricted | โค ways ร page size (e.g. 4 ร 4 KiB = 16 KiB / way) |
| Where you see it | Early ARM, some DSPs | Most server CPUs, L2/L3 everywhere | Most modern L1 (ARM Cortex-A, Apple M-series, AMD Zen, recent Intel) |
On Apple M3 the L1D is 128 KiB per core, 8-way, VIPT, and the L2 is 16 MiB shared, PIPT. When
ollama is decoding from a 14B model whose weights mmap-mostly into the cache hierarchy,
every token reads several MiB of weights through L2/L3 โ DRAM. That's why decode rate stays near-constant
at the bandwidth ceiling regardless of context length — the L1 indexing trick wins us latency, but
decode is governed by the L2/DRAM end of the same diagram. The
OpenClaw workflow demo picks up from here.