โ† Back to Projects

Memory Hierarchy & Cache Indexing

Virtual memory, the TLB, three flavors of L1 cache (VIVT, PIPT, VIPT), main memory, and disk — one virtual address, traced end-to-end through every box that handles it.

1. The whole pipeline at a glance

A load instruction issues a 32-bit virtual address. Depending on which cache structure the CPU uses, the address is split differently and the lookup races (or follows) translation. The picture below shows the data path for all three indexing schemes side by side.

VIVT virtually indexed, virtually tagged PIPT physically indexed, physically tagged VIPT virtually indexed, physically tagged CPU VA L1 Cache (VIVT) index & tag = VA bits no TLB on hit โšก TLB only on L1 miss L2 / L3 PA-tagged DRAM Disk page fault CPU VA TLB VA โ†’ PA (always) L1 Cache (PIPT) index & tag = PA bits TLB on critical path ๐Ÿข L2 / L3 PA-tagged DRAM Disk page fault CPU VA TLB parallel w/ L1 index L1 Cache (VIPT) index = VA, tag = PA TLB hidden by index โšก๐Ÿข L2 / L3 PA-tagged DRAM Disk page fault
hot path (hit) miss path (deeper level) address translation parallel index / tag-check

2. Address-bit layout

Where the index and tag come from is the whole story. Assume a 32-bit VA, 4 KiB pages (12-bit page offset), and a 32 KiB 4-way set-associative L1 with 64 B lines (64 sets โ†’ 6-bit index, 6-bit block offset). That leaves 20 tag bits.

VIVT โ€” split the virtual address only virtual tag (12 bits) virtual page (8 bits) index (6) offset (6) PIPT โ€” translate first, then split the physical address VPN (20 bits) โ†’ TLB โ†’ PPN (20 bits) physical tag (20 bits) idx (6) offset (6) VIPT โ€” index from VA, tag from PA (in parallel) VPN bits idx (6) off (6) TLB โ†’ PPN (20 b) compare against PA tag (20 b)

In VIPT the 6 index bits all come from the page offset (bits 6โ€“11), so they are invariant under translation. That's the trick: index lookup can fire on the raw VA without waiting for the TLB. Total index + offset bits โ‰ค page-offset bits is the design constraint — widen the cache and you lose the property.

3. Walk an address through each pipeline

4. Latency & cost โ€” why VIPT wins on most modern cores

L1 (VIPT / VIVT hit)
~1 ns
L1 (PIPT hit)
~1.4 ns
TLB hit
~0.4 ns
L2
~4 ns
L3
~12 ns
DRAM
~80 ns
TLB miss + page walk
~30 ns
NVMe (page fault)
~100 ยตs
HDD (page fault)
~10 ms

5. Side-by-side comparison

VIVTPIPTVIPT
Critical-path TLB?No (only on miss)YesNo (TLB runs in parallel)
Aliasing (synonyms)?Yes — two VAs โ†’ same PA can sit in two waysNoYes iff index bits exceed page offset
Homonyms (same VA, different process)?Yes — needs PID tags or flush on context switchNoNo
Permission bits checked when?Late (on miss path)Early (TLB)Early (TLB, parallel)
Max practical cache sizeLimited by process-tag widthUnrestrictedโ‰ค ways ร— page size (e.g. 4 ร— 4 KiB = 16 KiB / way)
Where you see itEarly ARM, some DSPsMost server CPUs, L2/L3 everywhereMost modern L1 (ARM Cortex-A, Apple M-series, AMD Zen, recent Intel)

6. Why this matters for the WukLab edge-inference work

On Apple M3 the L1D is 128 KiB per core, 8-way, VIPT, and the L2 is 16 MiB shared, PIPT. When ollama is decoding from a 14B model whose weights mmap-mostly into the cache hierarchy, every token reads several MiB of weights through L2/L3 โ†’ DRAM. That's why decode rate stays near-constant at the bandwidth ceiling regardless of context length — the L1 indexing trick wins us latency, but decode is governed by the L2/DRAM end of the same diagram. The OpenClaw workflow demo picks up from here.