← Back to Projects

Memory Hierarchy & Cache Indexing

Virtual memory, the TLB, three flavors of L1 cache (VIVT, PIPT, VIPT), main memory, and disk — one virtual address, traced end-to-end through every box that handles it.

1. The whole pipeline at a glance

A load instruction issues a 32-bit virtual address. Depending on which cache structure the CPU uses, the address is split differently and the lookup races (or follows) translation. The picture below shows the data path for all three indexing schemes side by side.

hot path (hit) miss path (deeper level) address translation parallel index / tag-check

2. Address-bit layout

Where the index and tag come from is the whole story. Assume a 32-bit VA, 4 KiB pages (12-bit page offset), and a 32 KiB 4-way set-associative L1 with 64 B lines (64 sets → 6-bit index, 6-bit block offset). That leaves 20 tag bits.

In VIPT the 6 index bits all come from the page offset (bits 6–11), so they are invariant under translation. That's the trick: index lookup can fire on the raw VA without waiting for the TLB. Total index + offset bits ≤ page-offset bits is the design constraint — widen the cache and you lose the property.

3. Walk an address through each pipeline

Virtual address (hex, 32-bit)

Cache structure

Scenario

4. Latency & cost — why VIPT wins on most modern cores

L1 (VIPT / VIVT hit)

~1 ns

L1 (PIPT hit)

~1.4 ns

TLB hit

~0.4 ns

~4 ns

~12 ns

DRAM

~80 ns

TLB miss + page walk

~30 ns

NVMe (page fault)

~100 µs

HDD (page fault)

~10 ms

5. Side-by-side comparison

	VIVT	PIPT	VIPT
Critical-path TLB?	No (only on miss)	Yes	No (TLB runs in parallel)
Aliasing (synonyms)?	Yes — two VAs → same PA can sit in two ways	No	Yes iff index bits exceed page offset
Homonyms (same VA, different process)?	Yes — needs PID tags or flush on context switch	No	No
Permission bits checked when?	Late (on miss path)	Early (TLB)	Early (TLB, parallel)
Max practical cache size	Limited by process-tag width	Unrestricted	≤ ways × page size (e.g. 4 × 4 KiB = 16 KiB / way)
Where you see it	Early ARM, some DSPs	Most server CPUs, L2/L3 everywhere	Most modern L1 (ARM Cortex-A, Apple M-series, AMD Zen, recent Intel)

6. Why this matters for the WukLab edge-inference work

On Apple M3 the L1D is 128 KiB per core, 8-way, VIPT, and the L2 is 16 MiB shared, PIPT. When ollama is decoding from a 14B model whose weights mmap-mostly into the cache hierarchy, every token reads several MiB of weights through L2/L3 → DRAM. That's why decode rate stays near-constant at the bandwidth ceiling regardless of context length — the L1 indexing trick wins us latency, but decode is governed by the L2/DRAM end of the same diagram. The OpenClaw workflow demo picks up from here.