6x memory reduction, 8x attention speedup, zero accuracy loss — and what it actually means
On March 24, 2026, four researchers at Google — Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni — published a blog post about a compression algorithm called TurboQuant. Within hours, memory stocks were falling. Micron dropped 3%. Western Digital lost 4.7%. SanDisk fell 5.7%.
The investor logic was simple: if AI models need 6x less memory, the industry buys 6x fewer memory chips. The AI memory boom — which has driven record revenue for memory manufacturers — suddenly looked smaller.
Imagine you're packing books into boxes. Some books are huge (encyclopedias) and some are tiny (pocket guides). If you use one box size for everything, the tiny books rattle around wasting space, and the huge books don't fit. That's the problem with compressing the KV cache — some values are 1,000x larger than others, so one-size-fits-all compression wastes precision on the small values.
TurboQuant's trick: before packing, shuffle all the books randomly. After shuffling, every box gets a mix of big and small books, so they're all roughly the same size. Now one box size works perfectly. That's PolarQuant — a random rotation that makes all values similar in magnitude.
The second trick: after packing, check for any books that got slightly damaged in the process. Record just one bit per book — "slightly too big" or "slightly too small." That's QJL — a 1-bit error correction that fixes the compression artifacts.
Result: the KV cache shrinks from 16 bits per value to 3.5 bits per value. No accuracy loss. No training needed. The paper (arxiv 2504.19874) will be presented at ICLR 2026.
To understand whether TurboQuant's claims are real, we first need to understand what it's compressing. If you've been following our LLM Inference at Scale series, you know the KV cache intimately. Let's recap the numbers from Part 1.
During inference, every token the model has processed gets stored as Key and Value vectors in the KV cache. For Llama-70B with a 1M-token context:
KV CACHE SIZE (from Part 1 of our series):
══════════════════════════════════════════
Model: Llama-70B (80 layers, 8 KV heads, d_head = 128)
Sequence: 1,000,000 tokens
Precision: BF16 (2 bytes per value)
Per layer, per token:
K: 8 heads x 128 dims = 1,024 values x 2 bytes = 2,048 bytes
V: same = 2,048 bytes
K + V per token per layer: 4,096 bytes = 4 KB
Per layer, all tokens:
1,000,000 tokens x 4,096 bytes = 4,096,000,000 bytes = 4.096 GB
All 80 layers:
4.096 GB x 80 = 327.68 GB
The model weights are 140 GB (70B params x 2 bytes).
KV cache / weights = 327.68 / 140 = 2.34x
The KV cache is 2.34x LARGER than the model itself.
This is why we spent Parts 2-5 building FSDP, Ulysses, and Ring Attention — to distribute this 327.68 GB across 32 GPUs (10.24 GB per GPU). TurboQuant attacks the problem from a completely different angle: instead of distributing the data, compress it.
What does the KV cache look like at different bit-widths? Let's derive it from the per-layer number:
KV CACHE AT DIFFERENT BIT-WIDTHS:
═════════════════════════════════
Starting point: 4.096 GB per layer at 16-bit (BF16).
To convert to b-bit: multiply by (b / 16).
Per layer 80 layers Per GPU (32 GPUs)
───────── ───────── ─────────────────
BF16 (16-bit) 4.096 GB 327.68 GB 10.24 GB
INT4 (4-bit) 1.024 GB 81.92 GB 2.56 GB
3.5-bit (TQ) 0.896 GB 71.68 GB 2.24 GB
2.5-bit (TQ) 0.640 GB 51.20 GB 1.60 GB
Derivation for 3.5-bit:
4.096 GB x (3.5 / 16) = 4.096 x 0.21875 = 0.896 GB per layer
0.896 x 80 = 71.68 GB total
71.68 / 32 = 2.24 GB per GPU
Compression ratio:
BF16 to 3.5-bit: 327.68 / 71.68 = 4.57x
BF16 to 2.5-bit: 327.68 / 51.20 = 6.40x
At 3.5 bits, the entire KV cache is 71.68 GB — it fits on a single H100 (80 GB HBM). At 2.5 bits, it's 51.20 GB, leaving 28.8 GB of headroom for weights and activations.
But there's a catch. You can't just slap 3.5-bit quantization on the KV cache and call it a day. Standard quantization at low bit-widths destroys accuracy. Let's see why.
The KV cache has a specific structure that makes naive quantization destructive: outliers. In transformer KV caches, the value distributions across dimensions are wildly different. Some dimensions have values in the range [-100, 100]. Others are in [-0.1, 0.1]. This is a well-documented phenomenon — it's why methods like SmoothQuant and AWQ exist.
THE OUTLIER PROBLEM:
════════════════════
KV cache values for one token, one KV head, across 8 dimensions
(simplified from the real d_head = 128):
Dim: 0 1 2 3 4 5 6 7
Value: 0.02 -0.05 0.03 87.4 0.01 -0.04 0.06 -42.1
Dims 3 and 7 are outliers (100-1000x larger than the rest).
Now let's try to quantize this vector using signed 4-bit integers (INT4). Signed INT4 has 16 levels: integers from -8 to +7. To map floating-point values to these 16 levels, we need a scale factor and a formula.
NAIVE INT4 QUANTIZATION (signed, 16 levels: -8 to +7):
═══════════════════════════════════════════════════════
Step 1: Find the range.
max_abs = max(|87.4|, |-42.1|, ...) = 87.4
We map the range [-87.4, +87.4] to integer levels [-8, +7].
Step 2: Compute the scale factor.
scale = max_abs / 7 = 87.4 / 7 = 12.486
(We use 7, not 8, because the positive range of signed INT4 is 0 to +7.
This gives us symmetric quantization around zero.)
Step 3: Quantize each value.
Formula: index = clamp(round(value / scale), -8, +7)
Dim 0 (value = 0.02):
index = round(0.02 / 12.486) = round(0.0016) = 0
Dim 3 (value = 87.4):
index = round(87.4 / 12.486) = round(6.999) = 7
Dim 7 (value = -42.1):
index = round(-42.1 / 12.486) = round(-3.372) = -3
Step 4: Dequantize (to see the error).
Formula: reconstructed = index x scale
Dim 0: reconstructed = 0 x 12.486 = 0.000
Error: |0.02 - 0.000| = 0.020. Relative error: 100%.
The value 0.02 is completely lost.
Dim 3: reconstructed = 7 x 12.486 = 87.402
Error: |87.4 - 87.402| = 0.002. Relative error: 0.002%.
The outlier is preserved perfectly.
Dim 7: reconstructed = -3 x 12.486 = -37.457
Error: |-42.1 - (-37.457)| = 4.643. Relative error: 11.0%.
Moderate error.
SUMMARY:
Dim Value Index Reconstructed Abs Error Rel Error
─── ───── ───── ───────────── ───────── ─────────
0 0.02 0 0.000 0.020 100.0%
1 -0.05 0 0.000 0.050 100.0%
2 0.03 0 0.000 0.030 100.0%
3 87.40 7 87.402 0.002 0.0%
4 0.01 0 0.000 0.010 100.0%
5 -0.04 0 0.000 0.040 100.0%
6 0.06 0 0.000 0.060 100.0%
7 -42.10 -3 -37.457 4.643 11.0%
6 out of 8 dimensions have 100% error — their information is destroyed.
The scale factor (12.486) is set by the outlier (87.4), so small values
all round to zero. This is catastrophic for model quality.
The standard fix is per-dimension normalization: give each dimension its own scale factor, quantize the normalized values, and store the scale factors alongside the quantized data.
PER-DIMENSION NORMALIZATION:
════════════════════════════
For each dimension d in the KV vector:
scale_d = max(|values across all tokens in dim d|)
normalized = value / scale_d (now in [-1, 1])
index = round(normalized x 7) (maps [-1,1] to [-7,+7])
reconstructed = (index / 7) x scale_d (to dequantize)
This gives each dimension its own range. Small dimensions get
a small scale factor, so their values don't round to zero.
But we must STORE one scale factor per dimension.
Let's compute the overhead for our Llama-70B setup.
STORAGE COST WITH PER-DIMENSION NORMALIZATION:
══════════════════════════════════════════════
KV cache per token per layer:
8 KV heads x 128 dims = 1,024 values
Quantized values (signed INT4 = 4 bits each):
1,024 values x 4 bits = 4,096 bits = 512 bytes per token per layer
Scale factors (one FP16 per dimension):
1,024 dimensions x 2 bytes (FP16) = 2,048 bytes per token per layer
Total per token per layer:
512 bytes (quantized) + 2,048 bytes (scales) = 2,560 bytes
Compare to BF16 (no quantization):
1,024 values x 2 bytes = 2,048 bytes per token per layer
Wait — the "compressed" version (2,560 bytes) is LARGER than the
original (2,048 bytes)! The scale factors alone (2,048 bytes) are
as large as the entire uncompressed KV cache.
Per-dimension normalization at 4-bit doesn't compress at all.
It makes things WORSE.
In practice, systems use coarser normalization — one scale per group of 32 or 128 values instead of per-dimension. This reduces overhead but introduces quantization error within each group. The tradeoff: less overhead but more error. TurboQuant eliminates this tradeoff entirely by removing the need for scale factors altogether.
PolarQuant rotates every KV cache vector by a random orthogonal matrix Pi before quantizing. But what is an orthogonal matrix, and how do we build one? This subsection answers both questions from scratch.
An orthogonal matrix is a square matrix where every row is a unit vector (length 1) and every pair of rows is perpendicular (dot product = 0). The defining property is PiT @ Pi = I. Geometrically, multiplying a vector by an orthogonal matrix rotates it — the direction changes but the length stays exactly the same. This is why it's useful: we can rearrange the values in a vector without losing any information.
To build a random orthogonal matrix, we start with a random matrix and orthogonalize it using a process called Gram-Schmidt. The idea: take a set of random vectors and make them perpendicular to each other, one at a time.
Imagine you have a set of arrows (vectors) pointing in random directions in 8D space. Some arrows might be nearly parallel, others nearly perpendicular. Gram-Schmidt takes these messy arrows and produces a new set where every arrow is exactly perpendicular to every other arrow, and every arrow has length exactly 1.
The idea is simple: process the arrows one at a time. For each new arrow, subtract out any component that points along an arrow you've already processed. What's left is the part that's genuinely new — perpendicular to everything before it. Then scale it to length 1.
GRAM-SCHMIDT: THE ALGORITHM
═══════════════════════════
Input: d vectors g0, g1, ..., g_{d-1} in R^d (the rows of a random matrix G).
Output: d orthonormal vectors e0, e1, ..., e_{d-1} (the rows of Pi).
e0 = g0 / ||g0||
(Just normalize the first vector to unit length.)
For each subsequent vector g_k (k = 1, 2, ..., d-1):
1. Compute projections onto all previous e's:
proj_j = <g_k, e_j> for j = 0, 1, ..., k-1
(This measures "how much of g_k points along e_j".)
2. Subtract all projections:
v_k = g_k - proj_0 * e0 - proj_1 * e1 - ... - proj_{k-1} * e_{k-1}
(Remove every component that overlaps with previous vectors.
What remains is ONLY the part perpendicular to all of them.)
3. Normalize:
e_k = v_k / ||v_k||
(Scale to unit length.)
Result: e0, e1, ..., e_{d-1} are mutually perpendicular unit vectors.
Stack them as rows → orthogonal matrix Pi.
Why does step 2 guarantee perpendicularity? Because we subtracted out exactly the component along each previous vector. After subtraction, the dot product with any previous vector is zero:
WHY SUBTRACTION MAKES VECTORS PERPENDICULAR:
════════════════════════════════════════════
After step 2: v_k = g_k - <g_k, e0> * e0 - <g_k, e1> * e1 - ...
Check <v_k, e0>:
<v_k, e0> = <g_k, e0> - <g_k, e0> * <e0, e0> - <g_k, e1> * <e1, e0> - ...
= <g_k, e0> - <g_k, e0> * 1 - <g_k, e1> * 0 - ...
───────────────── ─────────────────
(e0 is unit: <e0,e0>=1) (e0 ⊥ e1: <e1,e0>=0)
= <g_k, e0> - <g_k, e0>
= 0 ✓
The same argument works for every previous vector.
This is why Gram-Schmidt always produces perpendicular vectors.
Let's trace the process on the actual matrix we'll use throughout this article. We start by drawing a random 8×8 Gaussian matrix G:
STEP 1: RANDOM GAUSSIAN MATRIX G
═════════════════════════════════
Draw each entry independently from N(0, 1) (mean 0, std dev 1).
G (first 2 of 8 rows shown):
g0 = [ 1.70, -0.35, 1.56, -0.80, 1.06, 0.01, 0.67, 1.03]
g1 = [ 0.39, 1.19, -1.16, 1.49, -1.33, 1.01, -1.03, -1.20]
... (6 more rows)
These rows point in random directions and have random lengths.
They are NOT perpendicular: <g0, g1> ≠ 0 in general.
STEP 2: GRAM-SCHMIDT, TRACED
═════════════════════════════
Row 0 — just normalize:
||g0|| = sqrt(1.70^2 + 0.35^2 + 1.56^2 + 0.80^2 + 1.06^2
+ 0.01^2 + 0.67^2 + 1.03^2)
= sqrt(2.89 + 0.12 + 2.43 + 0.64 + 1.12 + 0.00 + 0.45 + 1.06)
= sqrt(8.72) = 2.9524
e0 = g0 / 2.9524
= [0.5749, -0.1185, 0.5297, -0.2713, 0.3579, 0.0038, 0.2283, 0.3478]
||e0|| = 1.0 ✓
Row 1 — subtract projection onto e0, then normalize:
proj = <g1, e0>
= 0.39*0.5749 + 1.19*(-0.1185) + (-1.16)*0.5297 + 1.49*(-0.2713)
+ (-1.33)*0.3579 + 1.01*0.0038 + (-1.03)*0.2283 + (-1.20)*0.3478
= 0.224 - 0.141 - 0.614 - 0.404 - 0.476 + 0.004 - 0.235 - 0.417
= -2.060
v1 = g1 - (-2.060) * e0 = g1 + 2.060 * e0
(Since proj is negative, we're ADDING a scaled e0 to g1.)
||v1|| = 2.4853
e1 = v1 / 2.4853
= [0.6323, 0.3805, -0.0269, 0.3757, -0.2374, 0.4113, -0.2233, -0.1953]
Verify: <e0, e1> = 0.5749*0.6323 + (-0.1185)*0.3805 + ... = 2.0 x 10^-16 ≈ 0 ✓
Rows 2-7: same process. Each row k subtracts projections onto
e0, e1, ..., e_{k-1}, then normalizes. After all 8 rows, we have Pi.
THE ACTUAL Pi MATRIX (8x8, used throughout this article):
═════════════════════════════════════════════════════════
dim0 dim1 dim2 dim3 dim4 dim5 dim6 dim7
┌────────────────────────────────────────────────────────────────────────┐
│ 0.5749 -0.1185 0.5297 -0.2713 0.3579 0.0038 0.2283 0.3478│ e0
│ 0.6323 0.3805 -0.0269 0.3757 -0.2374 0.4113 -0.2233 -0.1953│ e1
│ 0.0485 0.0090 0.0018 0.7615 0.0409 -0.4966 0.1353 0.3886│ e2
│ 0.0279 -0.7642 0.2188 0.2120 -0.4788 0.2508 0.1521 -0.0843│ e3
│ 0.3035 0.0251 0.0184 -0.3433 -0.5648 -0.6470 -0.2196 -0.0566│ e4
│ 0.2532 -0.0127 -0.6730 -0.1686 -0.1433 0.0855 0.6179 0.2114│ e5
│ -0.3153 0.3869 0.2877 -0.0948 -0.4841 0.2719 0.0736 0.5877│ e6
│ -0.1040 0.3265 0.3672 0.0713 -0.1037 -0.1463 0.6489 -0.5380│ e7
└────────────────────────────────────────────────────────────────────────┘
VERIFICATION — Pi^T @ Pi = I:
Every diagonal entry = 1.000000 (each row has unit length)
Every off-diagonal entry < 10^-15 (every pair of rows is perpendicular)
Notice: every entry is roughly ±0.1 to ±0.7. No entry is huge or tiny.
This is what makes the rotation spread energy evenly — each output
coordinate is a balanced weighted sum of ALL input coordinates.
In practice, PolarQuant uses d=128 (Llama's d_head), so Pi is 128x128.
The process is identical — just more rows to orthogonalize.
Let's see what a random rotation does to a KV cache vector. We'll use d=8 (simplified from the real d=128) with two outlier dimensions:
BEFORE ROTATION (8D vector with outliers):
══════════════════════════════════════════
x = [0.02, -0.05, 0.03, 87.4, 0.01, -0.04, 0.06, -42.1]
||x|| = sqrt(0.02^2 + 0.05^2 + 0.03^2 + 87.4^2 + 0.01^2
+ 0.04^2 + 0.06^2 + 42.1^2)
= sqrt(7638.76 + 1772.41 + 0.01) = sqrt(9411.18) = 97.01
"Energy" of a vector = sum of squared values (this is ||x||^2).
||x||^2 = 0.02^2 + 0.05^2 + 0.03^2 + 87.4^2 + 0.01^2
+ 0.04^2 + 0.06^2 + 42.1^2
= 0.0004 + 0.0025 + 0.0009 + 7638.76 + 0.0001
+ 0.0016 + 0.0036 + 1772.41
= 9411.18
Dim 3 alone: 87.4^2 = 7638.76. Share: 7638.76 / 9411.18 = 81.2%
Dim 7 alone: 42.1^2 = 1772.41. Share: 1772.41 / 9411.18 = 18.8%
Dims 3+7 together: 81.2% + 18.8% = 100.0% of the energy.
Other 6 dims combined: 0.0091 / 9411.18 = 0.0001% — negligible.
Range ratio: max |value| / min |value| = 87.4 / 0.01 = 8,740x
AFTER RANDOM ROTATION (y = Pi @ x, using the Pi from Section 4.0):
═══════════════════════════════════════════════════════════════════
Each output y_i = dot product of Pi's row i with x.
Let's trace y[0] to see exactly how the outlier energy gets spread:
y[0] = Pi[0][0]*x[0] + Pi[0][1]*x[1] + ... + Pi[0][7]*x[7]
= 0.5749*0.02 + (-0.1185)*(-0.05) + 0.5297*0.03 + (-0.2713)*87.4
+ 0.3579*0.01 + 0.0038*(-0.04) + 0.2283*0.06 + 0.3478*(-42.1)
= 0.01 + 0.01 + 0.02 + (-23.71) + 0.00 + (-0.00) + 0.01 + (-14.64)
= -38.3
The outlier x[3]=87.4 contributes -23.71 (via weight -0.2713).
The outlier x[7]=-42.1 contributes -14.64 (via weight 0.3478).
The small dims contribute ~0.05 total — negligible.
The two outliers are MIXED into this single output coordinate.
All 8 output coordinates are computed the same way — each is a
different weighted sum of all inputs. The result:
y = [-38.3, 41.0, 50.2, 22.1, -27.6, -23.6, -33.1, 28.9]
||y|| = sqrt(38.3^2 + 41.0^2 + 50.2^2 + 22.1^2 + 27.6^2
+ 23.6^2 + 33.1^2 + 28.9^2)
= sqrt(9411.18) = 97.01 ✓ (norm preserved)
Range of |y|: [22.1, 50.2]. Range ratio: 50.2 / 22.1 = 2.27x.
Compare to before: 8,740x → 2.27x.
Why are all values roughly the same magnitude? Each y_i^2 gets
an equal share of the total energy (by symmetry of the rotation):
E[y_i^2] = ||x||^2 / d = 9411.18 / 8 = 1176.4
Expected |y_i| ≈ sqrt(1176.4) = 34.3
Our actual values (22.1 to 50.2) scatter around this expected 34.3.
The outliers are gone. Every dimension carries similar energy.
A uniform quantization grid now works for ALL dimensions.
Now let's quantize both versions with 3 bits (8 levels) and compare the attention score error — because that's what actually matters for model quality.
First, we need a query vector to compute attention scores against. Let's pick one:
SETUP:
══════
Query vector (full precision, not quantized):
Q = [0.5, -0.3, 0.8, 0.1, -0.6, 0.4, -0.2, 0.7]
True attention score: Q . x = sum(Q[i] * x[i])
= 0.5*0.02 + (-0.3)*(-0.05) + 0.8*0.03 + 0.1*87.4
+ (-0.6)*0.01 + 0.4*(-0.04) + (-0.2)*0.06 + 0.7*(-42.1)
= 0.01 + 0.015 + 0.024 + 8.74 - 0.006 - 0.016 - 0.012 - 29.47
= -20.715
Now quantize x without rotation and see what happens to the score:
WITHOUT ROTATION — quantize x directly (3-bit):
════════════════════════════════════════════════
3 bits = 8 levels. For signed symmetric quantization, we use
levels: -3.5, -2.5, -1.5, -0.5, +0.5, +1.5, +2.5, +3.5
(8 levels centered on zero, spaced 1 apart, ranging from -3.5 to +3.5).
Scale = max_abs / 3.5 = 87.4 / 3.5 = 24.97
(We divide by 3.5 because the largest level is 3.5.)
To quantize: level = clamp(round(value / scale), -3.5, +3.5)
To dequantize: reconstructed = level x scale
Dim 0: round(0.02 / 24.97) = round(0.001) = 0 → 0 x 24.97 = 0.00
Dim 3: round(87.4 / 24.97) = round(3.50) = 3.5 → 3.5 x 24.97 = 87.40
Dim 7: round(-42.1 / 24.97) = round(-1.69) = -2 → -2 x 24.97 = -49.94
x_hat = [0, 0, 0, 87.4, 0, 0, 0, -49.9]
Six of eight dimensions round to zero — their information is gone.
Estimated score: Q . x_hat
= 0.5*0 + (-0.3)*0 + 0.8*0 + 0.1*87.4
+ (-0.6)*0 + 0.4*0 + (-0.2)*0 + 0.7*(-49.9)
= 8.74 - 34.93 = -26.19
Error: |-20.715 - (-26.19)| = 5.475
Relative error: 5.475 / 20.715 = 26.4%
The 26.4% error comes from dim 7: the true value was -42.1 but it quantized to -49.9 (an error of 7.8). That error gets multiplied by Q[7] = 0.7, contributing 0.7 × 7.8 = 5.46 to the score error. One badly quantized dimension dominates the entire score.
Now quantize the rotated vector y instead:
WITH ROTATION — quantize y = Pi@x (3-bit):
═══════════════════════════════════════════
Same 8 levels (-3.5 to +3.5). Scale = 50.2 / 3.5 = 14.34.
(The range is smaller because rotation spread the outliers.)
y = [-38.3, 41.0, 50.2, 22.1, -27.6, -23.6, -33.1, 28.9]
y[0]: round(-38.3 / 14.34) = round(-2.67) = -3 → -3 x 14.34 = -43.03
y[1]: round(41.0 / 14.34) = round(2.86) = 3 → 3 x 14.34 = 43.03
y[2]: round(50.2 / 14.34) = round(3.50) = 3.5 → 3.5 x 14.34 = 50.20
y[3]: round(22.1 / 14.34) = round(1.54) = 2 → 2 x 14.34 = 28.69
y[4]: round(-27.6 / 14.34) = round(-1.92) = -2 → -2 x 14.34 = -28.69
y[5]: round(-23.6 / 14.34) = round(-1.65) = -2 → -2 x 14.34 = -28.69
y[6]: round(-33.1 / 14.34) = round(-2.31) = -2 → -2 x 14.34 = -28.69
y[7]: round(28.9 / 14.34) = round(2.02) = 2 → 2 x 14.34 = 28.69
y_hat = [-43.03, 43.03, 50.20, 28.69, -28.69, -28.69, -28.69, 28.69]
Every dimension quantizes to a nearby level. No dimension is destroyed.
To compute the attention score in the rotated space, we rotate Q too: Q_rot = Pi @ Q. Then Q_rot · y_hat ≈ Q · x (inner products are preserved under rotation — we prove this in Section 4.3). The key question is: how large are the per-coordinate errors?
PER-COORDINATE ERROR COMPARISON:
════════════════════════════════
The maximum quantization error for any coordinate is half a step size:
max_error = scale / 2
Without rotation (scale = 24.97):
max_error = 24.97 / 2 = 12.49 per coordinate
But 6 of 8 dims have value < 0.06, which rounds to 0.
Those dims have ~100% relative error — their signal is erased.
With rotation (scale = 14.34):
max_error = 14.34 / 2 = 7.17 per coordinate
Every dim has a value between 22 and 50, well above the step size.
No dimension is erased. Every dim contributes meaningful signal.
The step size is smaller (14.34 vs 24.97) because the range is smaller
(50.2 vs 87.4). The range is smaller because rotation spread the outliers.
And crucially: all 8 dimensions now USE the quantization grid,
instead of 6 dims rounding to zero and wasting their grid levels.
At d=8, rotation reduced the range ratio from 8,740x to 2.27x. At d=128 (Llama's actual d_head), the effect is even stronger. The error we care about is the attention score error: the difference between the true inner product <Q, K> and the estimated inner product <Q, K_hat> after quantization. Let's trace how this error behaves as d grows.
First, we need to understand how tightly the rotated coordinates cluster. When we multiply a unit vector by a random orthogonal matrix, the total energy (sum of squares) is preserved and split equally across all d output coordinates:
WHY EACH COORDINATE HAS VARIANCE 1/d:
═════════════════════════════════════
For a unit vector x (||x|| = 1) rotated by random orthogonal Pi:
||Pi @ x||^2 = ||x||^2 = 1 (orthogonality preserves length)
sum(y_i^2, i=0..d-1) = 1 (expanding the norm)
By symmetry of a random orthogonal matrix, each y_i^2 has the
same expected value. There are d of them, and they sum to 1:
E[y_0^2] = E[y_1^2] = ... = E[y_{d-1}^2]
d x E[y_i^2] = 1
E[y_i^2] = 1/d
So: Var(y_i) = 1/d, and Std(y_i) = 1/sqrt(d).
Higher d means each coordinate is more tightly concentrated around zero. This directly controls the quantization step size — a tighter range means finer steps with the same number of levels:
COORDINATE SPREAD AND STEP SIZE vs DIMENSION:
══════════════════════════════════════════════
(For a unit vector, 3-bit quantization = 8 levels)
d Std = 1/sqrt(d) 99% range Step size Max error/coord
── ─────────────── ────────── ───────── ───────────────
8 0.354 +-0.912 0.228 0.114
32 0.177 +-0.456 0.114 0.057
128 0.0884 +-0.228 0.057 0.029
512 0.0442 +-0.114 0.029 0.014
99% range = +-2.576 x std (from the near-Gaussian distribution).
Step size = (2 x 2.576 x std) / 8 levels.
Max error per coordinate = step size / 2 (worst case: value is
exactly between two levels).
At d=128, the max error per coordinate (0.029) is 3.9x smaller than at d=8 (0.114). But we care about the attention score error, which is an inner product over all d coordinates. More coordinates means more individual errors being summed. Does the total error grow?
FROM PER-COORDINATE ERROR TO ATTENTION SCORE ERROR:
═══════════════════════════════════════════════════
The attention score error is:
|<Q, K> - <Q, K_hat>| = |<Q, K - K_hat>| = |sum(Q_i * error_i)|
Each error_i is bounded by the max error per coordinate.
If the errors were all the same sign, the worst case would be:
d x max_error x max(|Q_i|)
But after rotation, the errors are roughly random in sign
(some positive, some negative). Random errors partially cancel
when summed. The expected magnitude of the sum of d random
terms, each with std σ, is:
Expected |sum| ≈ sqrt(d) x σ
This is the "square root of d" law from probability.
Now plug in σ = max_error ≈ std / 8 ≈ (1/sqrt(d)) / 8:
Attention score error ≈ sqrt(d) x (1/sqrt(d)) / 8
= sqrt(d) / sqrt(d) / 8
= 1/8
= 0.125
The sqrt(d) from summing d terms EXACTLY CANCELS the 1/sqrt(d)
from each term being smaller. The total attention score error
stays bounded at roughly 1/(number of quantization levels),
regardless of dimension.
This is the key insight: higher d doesn't make things worse.
The per-coordinate error shrinks, and the number of coordinates
grows, and these two effects cancel perfectly.
An orthogonal rotation preserves two things exactly:
WHAT ORTHOGONAL ROTATION PRESERVES:
═══════════════════════════════════
1. VECTOR LENGTH: ||Pi @ x|| = ||x||
2. INNER PRODUCTS: <Pi @ x, Pi @ y> = <x, y>
Proof of (2):
(Pi @ x)^T @ (Pi @ y) = x^T @ Pi^T @ Pi @ y
= x^T @ I @ y (Pi^T @ Pi = I for orthogonal Pi)
= x^T @ y
= <x, y>
This means: if we rotate BOTH Q and K by the same Pi,
the attention score Q @ K^T is UNCHANGED.
The rotation is invisible to the attention computation.
No information is lost. No approximation is introduced by the rotation itself.
The only error comes from the quantization step.
Here's the complete pipeline for Llama-70B's KV cache. It has three parts: a one-time setup, a write path (when new tokens enter the cache), and a read path (when computing attention).
SETUP (done once, shared across all tokens and layers):
═══════════════════════════════════════════════════════
Generate random orthogonal matrix: Pi (128 x 128).
Method: Gram-Schmidt orthogonalization of a random Gaussian matrix
(same process from Section 4.0, but 128x128 instead of 8x8).
Storage: 128 x 128 x 2 bytes (BF16) = 32 KB. Negligible.
Precompute Lloyd-Max centroids for the Beta(d=128) distribution.
(Lloyd-Max finds the quantization levels minimizing MSE for a given
distribution. Since we know the distribution of rotated coordinates
— it's a Beta distribution determined by d — we can precompute the
optimal centroids once, offline, without seeing any data.)
8 centroids for 3-bit. Stored as 8 x 2 bytes = 16 bytes. Negligible.
WRITE PATH (when a new K or V vector enters the cache):
═══════════════════════════════════════════════════════
1. Separate magnitude: r = ||k||, k_unit = k / r.
2. Rotate: y = Pi @ k_unit.
Cost: 128 x 128 = 16,384 multiply-adds. Negligible.
3. Quantize: for each of 128 coords, find nearest centroid (3 comparisons).
Cost: 128 x 3 = 384 comparisons. Negligible.
4. Store: 128 indices x 3 bits + 1 norm x 16 bits = 400 bits per vector.
READ PATH (when computing attention for a new query token):
═══════════════════════════════════════════════════════════
1. Rotate the query: q_rot = Pi @ q.
Cost: 16,384 multiply-adds. Done ONCE per query token.
2. For each cached key: score = q_rot . k_quantized_rotated.
The dequantization (index → centroid lookup) is fused into the dot product.
No inverse rotation needed. Keys stay in rotated form.
How much storage does this save? Let's derive it from the per-vector cost:
STORAGE (Llama-70B, 1M tokens):
═══════════════════════════════
Per vector: 128 coords x 3 bits + 16 bits norm = 400 bits = 50 bytes
Per token per layer (K+V): 8 heads x 2 (K,V) x 50 bytes = 800 bytes
Per layer (1M tokens): 1M x 800 = 800 MB
Compare to BF16: 1M x 8 x 128 x 2 x 2 = 4,096 MB per layer
Compression: 4,096 / 800 = 5.12x
All 80 layers: 800 MB x 80 = 64,000 MB = 64.0 GB
BF16 total: 327.68 GB
Ratio: 327.68 / 64.0 = 5.12x
PolarQuant minimizes the mean squared error between the original and quantized vectors. That sounds like exactly what we want. But attention doesn't use vectors directly — it computes inner products between Q and K vectors. And a quantizer that minimizes MSE can introduce systematic bias in inner products. Let's see this happen with actual numbers.
TRACED EXAMPLE: HOW QUANTIZATION BIASES ATTENTION SCORES
════════════════════════════════════════════════════════
Setup: 2D vectors (for clarity). 2-bit quantization (4 levels).
Lloyd-Max centroids for uniform distribution on [-1, 1]:
[-0.75, -0.25, +0.25, +0.75]
(Lloyd-Max finds the centroids that minimize MSE for a given
distribution. For a uniform distribution on [-1, 1] with 4 levels,
the optimal centroids split the range into 4 equal bins:
[-1, -0.5], [-0.5, 0], [0, 0.5], [0.5, 1].
Each centroid is the midpoint of its bin:
(-1 + -0.5)/2 = -0.75, (-0.5 + 0)/2 = -0.25, etc.)
True query: Q = [0.70, 0.30]
True key 1: K1 = [-0.10, 0.80] (should score higher — more aligned with Q)
True key 2: K2 = [0.10, 0.10] (should score lower)
TRUE ATTENTION SCORES:
score(Q, K1) = 0.70 x (-0.10) + 0.30 x 0.80 = -0.07 + 0.24 = 0.17
score(Q, K2) = 0.70 x 0.10 + 0.30 x 0.10 = 0.07 + 0.03 = 0.10
K1 wins: 0.17 > 0.10. Correct — K1 is more aligned with Q.
Now quantize K1 and K2 (Q stays in full precision during inference):
QUANTIZE K1 AND K2:
═══════════════════
K1 = [-0.10, 0.80]:
-0.10 → nearest centroid: -0.25 (rounded DOWN by 0.15)
0.80 → nearest centroid: 0.75 (rounded DOWN by 0.05)
K1_hat = [-0.25, 0.75]
K2 = [0.10, 0.10]:
0.10 → nearest centroid: 0.25 (rounded UP by 0.15)
0.10 → nearest centroid: 0.25 (rounded UP by 0.15)
K2_hat = [0.25, 0.25]
ESTIMATED ATTENTION SCORES (Q full precision, K quantized):
score(Q, K1_hat) = 0.70 x (-0.25) + 0.30 x 0.75 = -0.175 + 0.225 = 0.05
score(Q, K2_hat) = 0.70 x 0.25 + 0.30 x 0.25 = 0.175 + 0.075 = 0.25
K2 wins: 0.25 > 0.05. ← THE RANKING FLIPPED.
The model now thinks K2 is more relevant than K1.
In reality, K1 was the better match (0.17 vs 0.10).
Quantization didn't just add noise — it reversed the answer.
Why did this happen? The bias comes from a fundamental property of rounding: it's deterministic, not random. K1's values both got rounded down (negative bias), while K2's values both got rounded up (positive bias). When you take inner products, these correlated errors don't cancel — they accumulate in the same direction.
WHY THE RANKING FLIPPED — TRACING THE BIAS:
════════════════════════════════════════════
K1 quantization errors:
e1[0] = -0.25 - (-0.10) = -0.15 (rounded down)
e1[1] = 0.75 - 0.80 = -0.05 (rounded down)
K1 bias = Q[0]*e1[0] + Q[1]*e1[1]
= 0.70*(-0.15) + 0.30*(-0.05)
= -0.105 - 0.015
= -0.12 (K1's score pushed DOWN)
K2 quantization errors:
e2[0] = 0.25 - 0.10 = +0.15 (rounded up)
e2[1] = 0.25 - 0.10 = +0.15 (rounded up)
K2 bias = Q[0]*e2[0] + Q[1]*e2[1]
= 0.70*(+0.15) + 0.30*(+0.15)
= +0.105 + 0.045
= +0.15 (K2's score pushed UP)
Combined swing: 0.15 - (-0.12) = 0.27.
True gap was only 0.17 - 0.10 = 0.07.
The bias (0.27) is 3.9x larger than the true gap (0.07).
The ranking never had a chance.
The errors are DETERMINISTIC: -0.10 always rounds to -0.25 (always -0.15),
and 0.10 always rounds to 0.25 (always +0.15). If the errors were random,
they'd cancel in expectation. But quantization errors are systematic —
that's the source of the bias.
QJL (Quantized Johnson-Lindenstrauss) fixes the bias by adding a correction term based on the quantization residual. The name comes from the Johnson-Lindenstrauss lemma, a classical result in mathematics: random projections from high dimensions to low dimensions preserve pairwise distances (and therefore inner products) approximately. QJL applies this idea with an extreme twist — project down and keep only the signs (1 bit each).
THE JOHNSON-LINDENSTRAUSS IDEA (simplified):
════════════════════════════════════════════
If you project a vector from d dimensions to m dimensions using
a random matrix S, the projection preserves inner products
in expectation:
E[<S @ x, S @ y>] = <x, y>
The error decreases as m increases (proportional to 1/sqrt(m)).
QJL goes further: instead of storing the m projected values in
full precision, store only their SIGNS (+1 or -1). Each sign
is 1 bit. The inner product can still be estimated from the
pattern of sign agreements.
Here's the plan. QJL adds a cheap correction to PolarQuant's biased scores. The full process has three stages, and it helps to see the big picture before we trace the math:
The problem: PolarQuant's quantized keys give biased attention scores (we just saw a ranking flip). The bias equals <Q, e>, where e is the quantization residual (the error PolarQuant introduced). If we could compute <Q, e> exactly, we'd subtract it and get the true score. But we don't store e — that would cost as many bits as the original.
QJL's trick: at cache time, project the residual e through a random sign matrix and store only the signs (1 bit each). At inference time, project Q through the same matrix, then combine Q's projections with the stored signs to estimate <Q, e>. The estimate is noisy but unbiased — the expected value is exactly <Q, e>. The noise shrinks as you use more sign bits.
The three steps:
Let's trace each step on our 2D example from Section 5.1.
First, compute the quantization residual — the error that PolarQuant introduced:
QJL STEP 1: COMPUTE THE RESIDUAL
═════════════════════════════════
After PolarQuant:
True K1 = [-0.10, 0.80]
Quantized K1_hat = [-0.25, 0.75]
Residual e1 = K1 - K1_hat = [-0.10-(-0.25), 0.80-0.75] = [+0.15, +0.05]
This residual is the information PolarQuant lost.
If we could store it exactly, we'd have perfect reconstruction.
But storing it exactly costs as many bits as the original.
QJL's trick: compress the residual to just 1 bit per projection.
Next, project the residual through a random sign matrix and keep only the signs:
QJL STEP 2: PROJECT AND STORE SIGNS
════════════════════════════════════
Choose m = 4 random projections (in practice, m = 48-128).
Random sign matrix S (4 x 2), each entry is +1 or -1:
S = [ +1 -1 ]
[ -1 -1 ]
[ +1 +1 ]
[ -1 +1 ]
Project the residual: z = S @ e1
z[0] = (+1)(+0.15) + (-1)(+0.05) = +0.15 - 0.05 = +0.10
z[1] = (-1)(+0.15) + (-1)(+0.05) = -0.15 - 0.05 = -0.20
z[2] = (+1)(+0.15) + (+1)(+0.05) = +0.15 + 0.05 = +0.20
z[3] = (-1)(+0.15) + (+1)(+0.05) = -0.15 + 0.05 = -0.10
Store only the signs (using sign(0) = -1 by convention):
b_K1 = sign(z) = [+1, -1, +1, -1]
Storage: 4 bits. That's all we keep from the residual.
At inference time, when a query arrives, we estimate the inner product correction using the stored signs:
QJL STEP 3: ESTIMATE THE CORRECTION AT INFERENCE
═════════════════════════════════════════════════
We need to estimate: <Q, e1> = Q[0]*e1[0] + Q[1]*e1[1]
= 0.70*(+0.15) + 0.30*(+0.05)
= +0.105 + 0.015 = +0.12
Project Q with the SAME matrix S:
z_Q = S @ Q:
z_Q[0] = (+1)(0.70) + (-1)(0.30) = +0.40
z_Q[1] = (-1)(0.70) + (-1)(0.30) = -1.00
z_Q[2] = (+1)(0.70) + (+1)(0.30) = +1.00
z_Q[3] = (-1)(0.70) + (+1)(0.30) = -0.40
Compute the correction estimate:
correction = (1/m) * sum(z_Q[i] * b_K1[i])
= (1/4) * (0.40*(+1) + (-1.00)*(-1) + 1.00*(+1) + (-0.40)*(-1))
= (1/4) * (0.40 + 1.00 + 1.00 + 0.40)
= (1/4) * 2.80
= +0.70
How good is this estimate?
QJL ACCURACY:
═════════════
True correction needed: +0.12
QJL estimate: +0.70
Error: |0.70 - 0.12| = 0.58
That's a big error. But we only used m=4 projections. Why does more projections help? Think of it like polling: if you ask 4 random people a yes/no question, the result is noisy. Ask 100 people, and the average is much closer to the true answer. Each projection is like one "vote" — individually noisy, but the average converges to the truth.
WHY MORE PROJECTIONS = LESS ERROR:
══════════════════════════════════
Each of the m projections gives an independent, noisy estimate.
The QJL correction averages all m estimates (the (1/m) * sum formula).
When you average m independent random variables:
- The variance of the average = (variance of one) / m
- The std dev of the average = (std dev of one) / sqrt(m)
So the typical error shrinks as 1/sqrt(m):
m=4: error ~ 0.58 (what we just computed)
m=16: error ~ 0.58 / sqrt(16/4) = 0.58 / 2 = 0.29
m=64: error ~ 0.58 / sqrt(64/4) = 0.58 / 4 = 0.15
m=128: error ~ 0.58 / sqrt(128/4) = 0.58 / 5.7 = 0.10
At m=64 (a practical setting), the error is ~4x smaller than at m=4.
THE KEY PROPERTY: THE ESTIMATOR IS UNBIASED
════════════════════════════════════════════
E[correction] = <Q, e_K> exactly.
"Unbiased" means: if you could repeat the QJL estimate many times
with different random sign matrices S, the AVERAGE of all estimates
would equal the true correction exactly. Any single estimate may
overshoot or undershoot, but there's no systematic lean in one direction.
This is the critical difference from PolarQuant's bias:
PolarQuant: every estimate is biased the SAME way (always too high or too low)
QJL: each estimate has random error that averages to zero
The systematic bias from PolarQuant is replaced by zero-mean random
noise from QJL. Over many tokens in the KV cache, these random
errors average out.
The final attention score estimate combines PolarQuant and QJL:
TURBOQUANT ATTENTION SCORE:
═══════════════════════════
True score: <Q, K>
TurboQuant: <Q, K_hat> + QJL_correction
─────────── ──────────────
from PolarQuant from QJL sign bits
(biased) (unbiased correction)
The PolarQuant term <Q, K_hat> is biased (as we showed: -0.12 for K1).
The QJL correction estimates <Q, K - K_hat> = <Q, e_K> without bias.
In expectation:
E[TurboQuant score] = <Q, K_hat> + E[QJL_correction]
= <Q, K_hat> + <Q, e_K>
= <Q, K_hat + e_K>
= <Q, K>
TurboQuant recovers the TRUE inner product in expectation.
The remaining error is zero-mean noise that decreases as m grows.
MEMORY BUDGET AT d = 128:
═════════════════════════
PolarQuant (3-bit):
128 coordinates x 3 bits = 384 bits
1 norm (FP16) = 16 bits
Subtotal: 400 bits
QJL (m = 64 sign bits):
64 sign bits = 64 bits
Total per vector: 400 + 64 = 464 bits
Per coordinate: 464 / 128 = 3.625 bits
The paper reports 3.5 bits achieves quality neutrality.
This is consistent with m ~ 48-64 sign bits
(384 + 16 + 48 = 448 bits → 448/128 = 3.5 bits exactly).
The paper tests TurboQuant on Gemma, Mistral, and Llama-3.1-8B-Instruct across five long-context benchmarks:
BENCHMARK RESULTS:
══════════════════
TurboQuant TurboQuant Full Cache
3.5-bit 2.5-bit (FP16)
───────── ───────── ──────────
LongBench 50.06 49.44 50.06
Needle in Haystack 100.0 99.8 100.0
ZeroSCROLLS matches near-match baseline
RULER matches near-match baseline
L-Eval matches near-match baseline
At 3.5 bits: IDENTICAL to full-precision on LongBench (50.06 = 50.06)
and perfect on Needle in a Haystack (100.0 = 100.0).
At 2.5 bits: marginal degradation.
LongBench: 49.44 vs 50.06 = -1.2% relative.
NIAH: 99.8 vs 100.0 = one missed needle in 500 tests.
How does TurboQuant compare to existing KV cache compression methods? A key distinction is whether the method is data-oblivious: meaning it doesn't need to see any data to set its parameters. TurboQuant's rotation matrix is random (no data needed), and its quantizer centroids are precomputed from the known mathematical distribution (no calibration needed). Compare this to methods that require running the model on representative data to calibrate their parameters.
COMPARISON VS EXISTING KV CACHE COMPRESSION METHODS:
════════════════════════════════════════════════════
Method Needs training? Unbiased? Compression Attn speedup
────── ─────────────── ───────── ─────────── ────────────
TurboQuant No (data-oblivious) Yes (QJL) 4.6-6.4x 8x (logits)
KIVI Calibration data No ~4x ~4x
SnapKV Finetuning No 2-4x 2-4x
DuQuant Calibration data Partial ~4x ~4x
Does this mean we don't need 32 GPUs anymore? Let's revisit the memory budget from Parts 1-5 with TurboQuant applied. Same setup: Llama-70B, 1M tokens, 32 A100-80GB GPUs.
REVISED MEMORY BUDGET (32 GPUs, Llama-70B, 1M tokens):
═══════════════════════════════════════════════════════
FP16 TurboQuant 3.5-bit
(Parts 1-5) (with TQ applied)
─────────── ─────────────────
Weights per GPU (FSDP) 4.38 GB 4.38 GB (unchanged)
KV cache per GPU 10.24 GB 2.24 GB (4.57x smaller)
Activations per GPU 0.50 GB 0.50 GB (unchanged)
Overhead 2.00 GB 2.00 GB
──────────────────────────────────────────────────────────
Total per GPU 17.12 GB 9.12 GB
Headroom (80 GB) 62.88 GB 70.88 GB
Derivation of KV cache per GPU with TQ:
71.68 GB total / 32 GPUs = 2.24 GB per GPU ✓
BATCH CAPACITY (how many concurrent 1M-token requests):
Fixed cost per GPU: 4.38 + 2.00 = 6.38 GB
Per-request cost:
FP16: 10.24 + 0.50 = 10.74 GB/request
TQ 3.5: 2.24 + 0.50 = 2.74 GB/request
Max batch = floor((80 - 6.38) / per_request_cost)
FP16: floor(73.62 / 10.74) = floor(6.85) = 6 requests
TQ 3.5: floor(73.62 / 2.74) = floor(26.87) = 26 requests
4.3x more concurrent requests on the same 32-GPU hardware.
Or flip it: how many GPUs do we need for the same workload?
GPU COUNT FOR LLAMA-70B, 1M TOKENS, BATCH=1:
═════════════════════════════════════════════
FP16 TQ 3.5-bit
──── ──────────
Weights 140.00 GB 140.00 GB
KV cache 327.68 GB 71.68 GB
Total 467.68 GB 211.68 GB
Min GPUs (80 GB) 6 3
For KV cache alone:
FP16: 327.68 / 80 = 4.10 → need 5 GPUs
TQ: 71.68 / 80 = 0.90 → fits on 1 GPU
The KV cache fits on a SINGLE GPU with TurboQuant.
At moderate context lengths (128K tokens):
FP16 KV cache: 327.68 x (128K/1M) = 41.94 GB → 1 GPU (tight)
TQ KV cache: 71.68 x (128K/1M) = 9.18 GB → 1 GPU (easy)
The results are impressive. But several important limitations deserve attention before you rewrite your infrastructure plans.
The paper makes a striking theoretical claim: TurboQuant is within a factor of ~2.7 of the information-theoretic lower bound on distortion. What does this mean?
In 1948, Claude Shannon proved that for any communication channel (or compression scheme), there is a fundamental limit on how much you can compress data before the error becomes unavoidable. This is the Shannon limit — a hard floor set by the laws of information theory, not by any particular algorithm.
For quantization, the Shannon limit says: if you store a value using b bits, the minimum possible mean squared error (MSE) is bounded below by a function of b and the distribution of the values. No algorithm — no matter how clever — can beat this bound.
THE SHANNON LIMIT FOR QUANTIZATION:
═══════════════════════════════════
For a d-dimensional vector quantized to b bits per coordinate:
MSE_min ≥ c(d) x 2^(-2b)
where c(d) depends on the distribution of the input values.
At b = 3.5 bits (TurboQuant's operating point):
2^(-2 x 3.5) = 2^(-7) = 1/128 = 0.0078
The paper shows TurboQuant achieves MSE ≈ 2.7 x MSE_min.
This means:
- TurboQuant is 2.7x away from the theoretical best.
- A perfect (impossible-to-build) quantizer would be 2.7x better.
- The remaining room for improvement is at most 2.7x — not 10x or 100x.
In practical terms: going from TurboQuant to the Shannon limit
would save at most log2(2.7) ≈ 1.4 additional bits per value.
That's the difference between 3.5 bits and ~2.1 bits.
The low-hanging fruit has been picked.
| Metric | Value |
|---|---|
| KV cache compression | 4.57x (3.5-bit) to 6.40x (2.5-bit) |
| Attention logit speedup (H100) | 8x (4-bit vs 32-bit keys) |
| Accuracy loss (3.5-bit) | Zero on LongBench, NIAH, and 3 other benchmarks |
| Accuracy loss (2.5-bit) | -1.2% relative on LongBench |
| Fine-tuning required | None (data-oblivious) |
| Calibration data required | None |
| Effective bits (PolarQuant 3-bit + QJL) | ~3.5 bits per value |
| Distance from Shannon limit | ~2.7x factor |
| Models tested | Gemma, Mistral, Llama-3.1-8B only |
| Venue | ICLR 2026 (April 23-25) |
| Concept | What it does | Key insight |
|---|---|---|
| Outlier problem | KV cache dims have 1000x range differences | Naive quantization destroys small-value dimensions |
| Normalization overhead | Per-dim scale factors cost 2 bytes each | At low bit-widths, scales dominate storage (89% at 2-bit) |
| PolarQuant | Random rotation + scalar quantization | Rotation spreads outliers; no per-dim constants needed |
| QJL | 1-bit sign projection of residual | Creates unbiased inner product estimator; ~0.5 bits overhead |
| TurboQuant | PolarQuant + QJL combined | Near-optimal compression at 3.5 bits with zero accuracy loss |
| Impact on our setup | KV cache: 10.24 → 2.24 GB/GPU | Max batch jumps from 6 to 26 on same 32-GPU hardware |
TurboQuant is a genuinely elegant algorithm. The random rotation trick is simple (one matrix multiply), the math is clean (known distribution, precomputed quantizer), and the results are strong (zero loss at 3.5 bits). Whether it reshapes the memory industry depends on factors the paper doesn't address: scaling to large models, production integration, and whether the AI industry's appetite for memory grows faster than compression can shrink it. But as a technical contribution, it's the most significant KV cache compression result of 2026 — and it's within striking distance of the theoretical limit.
@misc{fofadiya2026turboquant,
author = {Darshan Fofadiya},
title = {An Illustrated Deep Dive into TurboQuant: PolarQuant, QJL, and KV Cache Compression from First Principles},
year = {2026},
url = {https://darshanfofadiya.com/research-papers/turboquant/}
}
Get notified when I publish new deep dives
Illustrated guides on LLM inference, quantum computing, and AI research papers.
⚠ After subscribing, check your inbox for a confirmation email. You won't receive posts until you confirm.