Created by Darshan Fofadiya

← Back to all articles

TurboQuant: The KV Cache Compression That Crashed Memory Stocks

6x memory reduction, 8x attention speedup, zero accuracy loss — and what it actually means

By Darshan Fofadiya · March 25, 2026

1. Why Memory Stocks Crashed

On March 24, 2026, four researchers at Google — Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni — published a blog post about a compression algorithm called TurboQuant. Within hours, memory stocks were falling. Micron dropped 3%. Western Digital lost 4.7%. SanDisk fell 5.7%.

The investor logic was simple: if AI models need 6x less memory, the industry buys 6x fewer memory chips. The AI memory boom — which has driven record revenue for memory manufacturers — suddenly looked smaller.

But what actually is TurboQuant? Is the 6x claim real? Does it actually mean the AI industry needs less memory? And how does a single matrix multiply achieve what years of quantization research couldn't? Let's trace through the algorithm, the math, and the implications — every step derived, every number verified.

TurboQuant in Plain Language

Imagine you're packing books into boxes. Some books are huge (encyclopedias) and some are tiny (pocket guides). If you use one box size for everything, the tiny books rattle around wasting space, and the huge books don't fit. That's the problem with compressing the KV cache — some values are 1,000x larger than others, so one-size-fits-all compression wastes precision on the small values.

TurboQuant's trick: before packing, shuffle all the books randomly. After shuffling, every box gets a mix of big and small books, so they're all roughly the same size. Now one box size works perfectly. That's PolarQuant — a random rotation that makes all values similar in magnitude.

The second trick: after packing, check for any books that got slightly damaged in the process. Record just one bit per book — "slightly too big" or "slightly too small." That's QJL — a 1-bit error correction that fixes the compression artifacts.

Result: the KV cache shrinks from 16 bits per value to 3.5 bits per value. No accuracy loss. No training needed. The paper (arxiv 2504.19874) will be presented at ICLR 2026.


2. The KV Cache Problem

To understand whether TurboQuant's claims are real, we first need to understand what it's compressing. If you've been following our LLM Inference at Scale series, you know the KV cache intimately. Let's recap the numbers from Part 1.

During inference, every token the model has processed gets stored as Key and Value vectors in the KV cache. For Llama-70B with a 1M-token context:

KV CACHE SIZE (from Part 1 of our series):
══════════════════════════════════════════
Model: Llama-70B (80 layers, 8 KV heads, d_head = 128)
Sequence: 1,000,000 tokens
Precision: BF16 (2 bytes per value)

Per layer, per token:
  K: 8 heads x 128 dims = 1,024 values x 2 bytes = 2,048 bytes
  V: same = 2,048 bytes
  K + V per token per layer: 4,096 bytes = 4 KB

Per layer, all tokens:
  1,000,000 tokens x 4,096 bytes = 4,096,000,000 bytes = 4.096 GB

All 80 layers:
  4.096 GB x 80 = 327.68 GB

The model weights are 140 GB (70B params x 2 bytes).
KV cache / weights = 327.68 / 140 = 2.34x
The KV cache is 2.34x LARGER than the model itself.

This is why we spent Parts 2-5 building FSDP, Ulysses, and Ring Attention — to distribute this 327.68 GB across 32 GPUs (10.24 GB per GPU). TurboQuant attacks the problem from a completely different angle: instead of distributing the data, compress it.

What does the KV cache look like at different bit-widths? Let's derive it from the per-layer number:

KV CACHE AT DIFFERENT BIT-WIDTHS:
═════════════════════════════════
Starting point: 4.096 GB per layer at 16-bit (BF16).

To convert to b-bit: multiply by (b / 16).

                    Per layer       80 layers       Per GPU (32 GPUs)
                    ─────────       ─────────       ─────────────────
BF16 (16-bit)       4.096 GB        327.68 GB       10.24 GB
INT4 (4-bit)        1.024 GB         81.92 GB        2.56 GB
3.5-bit (TQ)        0.896 GB         71.68 GB        2.24 GB
2.5-bit (TQ)        0.640 GB         51.20 GB        1.60 GB

Derivation for 3.5-bit:
  4.096 GB x (3.5 / 16) = 4.096 x 0.21875 = 0.896 GB per layer
  0.896 x 80 = 71.68 GB total
  71.68 / 32 = 2.24 GB per GPU

Compression ratio:
  BF16 to 3.5-bit: 327.68 / 71.68 = 4.57x
  BF16 to 2.5-bit: 327.68 / 51.20 = 6.40x

At 3.5 bits, the entire KV cache is 71.68 GB — it fits on a single H100 (80 GB HBM). At 2.5 bits, it's 51.20 GB, leaving 28.8 GB of headroom for weights and activations.

What TurboQuant compresses and what it doesn't. TurboQuant targets the KV cache only, not model weights. Weights still need FSDP or tensor parallelism to distribute (140 GB for Llama-70B). And the attention computation itself (O(n²)) still needs sequence parallelism for long contexts. But the KV cache — which was the single largest memory consumer at 327.68 GB — shrinks by 4.57x to 6.40x.

But there's a catch. You can't just slap 3.5-bit quantization on the KV cache and call it a day. Standard quantization at low bit-widths destroys accuracy. Let's see why.


3. Why Standard Quantization Fails for KV Cache

The KV cache has a specific structure that makes naive quantization destructive: outliers. In transformer KV caches, the value distributions across dimensions are wildly different. Some dimensions have values in the range [-100, 100]. Others are in [-0.1, 0.1]. This is a well-documented phenomenon — it's why methods like SmoothQuant and AWQ exist.

THE OUTLIER PROBLEM:
════════════════════
KV cache values for one token, one KV head, across 8 dimensions
(simplified from the real d_head = 128):

  Dim:      0       1       2       3       4       5       6       7
  Value:  0.02   -0.05    0.03   87.4    0.01   -0.04    0.06   -42.1

Dims 3 and 7 are outliers (100-1000x larger than the rest).

Now let's try to quantize this vector using signed 4-bit integers (INT4). Signed INT4 has 16 levels: integers from -8 to +7. To map floating-point values to these 16 levels, we need a scale factor and a formula.

NAIVE INT4 QUANTIZATION (signed, 16 levels: -8 to +7):
═══════════════════════════════════════════════════════
Step 1: Find the range.
  max_abs = max(|87.4|, |-42.1|, ...) = 87.4
  We map the range [-87.4, +87.4] to integer levels [-8, +7].

Step 2: Compute the scale factor.
  scale = max_abs / 7 = 87.4 / 7 = 12.486
  (We use 7, not 8, because the positive range of signed INT4 is 0 to +7.
   This gives us symmetric quantization around zero.)

Step 3: Quantize each value.
  Formula: index = clamp(round(value / scale), -8, +7)

  Dim 0 (value = 0.02):
    index = round(0.02 / 12.486) = round(0.0016) = 0

  Dim 3 (value = 87.4):
    index = round(87.4 / 12.486) = round(6.999) = 7

  Dim 7 (value = -42.1):
    index = round(-42.1 / 12.486) = round(-3.372) = -3

Step 4: Dequantize (to see the error).
  Formula: reconstructed = index x scale

  Dim 0: reconstructed = 0 x 12.486 = 0.000
    Error: |0.02 - 0.000| = 0.020. Relative error: 100%.
    The value 0.02 is completely lost.

  Dim 3: reconstructed = 7 x 12.486 = 87.402
    Error: |87.4 - 87.402| = 0.002. Relative error: 0.002%.
    The outlier is preserved perfectly.

  Dim 7: reconstructed = -3 x 12.486 = -37.457
    Error: |-42.1 - (-37.457)| = 4.643. Relative error: 11.0%.
    Moderate error.

SUMMARY:
  Dim    Value    Index    Reconstructed    Abs Error    Rel Error
  ───    ─────    ─────    ─────────────    ─────────    ─────────
  0       0.02     0        0.000           0.020        100.0%
  1      -0.05     0        0.000           0.050        100.0%
  2       0.03     0        0.000           0.030        100.0%
  3      87.40     7       87.402           0.002          0.0%
  4       0.01     0        0.000           0.010        100.0%
  5      -0.04     0        0.000           0.040        100.0%
  6       0.06     0        0.000           0.060        100.0%
  7     -42.10    -3      -37.457           4.643         11.0%

6 out of 8 dimensions have 100% error — their information is destroyed.
The scale factor (12.486) is set by the outlier (87.4), so small values
all round to zero. This is catastrophic for model quality.

The standard fix is per-dimension normalization: give each dimension its own scale factor, quantize the normalized values, and store the scale factors alongside the quantized data.

PER-DIMENSION NORMALIZATION:
════════════════════════════
For each dimension d in the KV vector:
  scale_d = max(|values across all tokens in dim d|)
  normalized = value / scale_d          (now in [-1, 1])
  index = round(normalized x 7)         (maps [-1,1] to [-7,+7])
  reconstructed = (index / 7) x scale_d (to dequantize)

This gives each dimension its own range. Small dimensions get
a small scale factor, so their values don't round to zero.

But we must STORE one scale factor per dimension.
Let's compute the overhead for our Llama-70B setup.
STORAGE COST WITH PER-DIMENSION NORMALIZATION:
══════════════════════════════════════════════
KV cache per token per layer:
  8 KV heads x 128 dims = 1,024 values

Quantized values (signed INT4 = 4 bits each):
  1,024 values x 4 bits = 4,096 bits = 512 bytes per token per layer

Scale factors (one FP16 per dimension):
  1,024 dimensions x 2 bytes (FP16) = 2,048 bytes per token per layer

Total per token per layer:
  512 bytes (quantized) + 2,048 bytes (scales) = 2,560 bytes

Compare to BF16 (no quantization):
  1,024 values x 2 bytes = 2,048 bytes per token per layer

Wait — the "compressed" version (2,560 bytes) is LARGER than the
original (2,048 bytes)! The scale factors alone (2,048 bytes) are
as large as the entire uncompressed KV cache.

Per-dimension normalization at 4-bit doesn't compress at all.
It makes things WORSE.
The normalization overhead trap. Per-dimension normalization stores one FP16 scale factor (2 bytes) per dimension per token. With 1,024 dimensions per token, that's 2,048 bytes of overhead — exactly equal to the uncompressed BF16 storage. At 4-bit quantization, the overhead exceeds the savings. At 2-bit, the quantized values shrink to 1,024 × 2 bits = 256 bytes, but the 2,048 bytes of scales remain — scales are 2,048 / (256 + 2,048) = 89% of total storage. You can't compress below the overhead floor.

In practice, systems use coarser normalization — one scale per group of 32 or 128 values instead of per-dimension. This reduces overhead but introduces quantization error within each group. The tradeoff: less overhead but more error. TurboQuant eliminates this tradeoff entirely by removing the need for scale factors altogether.


4. PolarQuant: The Random Rotation Trick

4.0 Building the Rotation Matrix

PolarQuant rotates every KV cache vector by a random orthogonal matrix Pi before quantizing. But what is an orthogonal matrix, and how do we build one? This subsection answers both questions from scratch.

An orthogonal matrix is a square matrix where every row is a unit vector (length 1) and every pair of rows is perpendicular (dot product = 0). The defining property is PiT @ Pi = I. Geometrically, multiplying a vector by an orthogonal matrix rotates it — the direction changes but the length stays exactly the same. This is why it's useful: we can rearrange the values in a vector without losing any information.

To build a random orthogonal matrix, we start with a random matrix and orthogonalize it using a process called Gram-Schmidt. The idea: take a set of random vectors and make them perpendicular to each other, one at a time.

The Gram-Schmidt Intuition

Imagine you have a set of arrows (vectors) pointing in random directions in 8D space. Some arrows might be nearly parallel, others nearly perpendicular. Gram-Schmidt takes these messy arrows and produces a new set where every arrow is exactly perpendicular to every other arrow, and every arrow has length exactly 1.

The idea is simple: process the arrows one at a time. For each new arrow, subtract out any component that points along an arrow you've already processed. What's left is the part that's genuinely new — perpendicular to everything before it. Then scale it to length 1.

GRAM-SCHMIDT: THE ALGORITHM
═══════════════════════════
Input: d vectors g0, g1, ..., g_{d-1} in R^d (the rows of a random matrix G).
Output: d orthonormal vectors e0, e1, ..., e_{d-1} (the rows of Pi).

  e0 = g0 / ||g0||
    (Just normalize the first vector to unit length.)

  For each subsequent vector g_k (k = 1, 2, ..., d-1):
    1. Compute projections onto all previous e's:
         proj_j = <g_k, e_j>    for j = 0, 1, ..., k-1
       (This measures "how much of g_k points along e_j".)

    2. Subtract all projections:
         v_k = g_k - proj_0 * e0 - proj_1 * e1 - ... - proj_{k-1} * e_{k-1}
       (Remove every component that overlaps with previous vectors.
        What remains is ONLY the part perpendicular to all of them.)

    3. Normalize:
         e_k = v_k / ||v_k||
       (Scale to unit length.)

  Result: e0, e1, ..., e_{d-1} are mutually perpendicular unit vectors.
  Stack them as rows → orthogonal matrix Pi.

Why does step 2 guarantee perpendicularity? Because we subtracted out exactly the component along each previous vector. After subtraction, the dot product with any previous vector is zero:

WHY SUBTRACTION MAKES VECTORS PERPENDICULAR:
════════════════════════════════════════════
After step 2: v_k = g_k - <g_k, e0> * e0 - <g_k, e1> * e1 - ...

Check <v_k, e0>:
  <v_k, e0> = <g_k, e0> - <g_k, e0> * <e0, e0> - <g_k, e1> * <e1, e0> - ...
             = <g_k, e0> - <g_k, e0> * 1       - <g_k, e1> * 0       - ...
                            ─────────────────       ─────────────────
                            (e0 is unit: <e0,e0>=1)  (e0 ⊥ e1: <e1,e0>=0)
             = <g_k, e0> - <g_k, e0>
             = 0  ✓

The same argument works for every previous vector.
This is why Gram-Schmidt always produces perpendicular vectors.

Building Our Actual 8×8 Pi

Let's trace the process on the actual matrix we'll use throughout this article. We start by drawing a random 8×8 Gaussian matrix G:

STEP 1: RANDOM GAUSSIAN MATRIX G
═════════════════════════════════
Draw each entry independently from N(0, 1) (mean 0, std dev 1).

G (first 2 of 8 rows shown):
  g0 = [ 1.70, -0.35,  1.56, -0.80,  1.06,  0.01,  0.67,  1.03]
  g1 = [ 0.39,  1.19, -1.16,  1.49, -1.33,  1.01, -1.03, -1.20]
  ... (6 more rows)

These rows point in random directions and have random lengths.
They are NOT perpendicular: <g0, g1> ≠ 0 in general.
STEP 2: GRAM-SCHMIDT, TRACED
═════════════════════════════
Row 0 — just normalize:
  ||g0|| = sqrt(1.70^2 + 0.35^2 + 1.56^2 + 0.80^2 + 1.06^2
               + 0.01^2 + 0.67^2 + 1.03^2)
         = sqrt(2.89 + 0.12 + 2.43 + 0.64 + 1.12 + 0.00 + 0.45 + 1.06)
         = sqrt(8.72) = 2.9524

  e0 = g0 / 2.9524
     = [0.5749, -0.1185, 0.5297, -0.2713, 0.3579, 0.0038, 0.2283, 0.3478]
  ||e0|| = 1.0  ✓

Row 1 — subtract projection onto e0, then normalize:
  proj = <g1, e0>
       = 0.39*0.5749 + 1.19*(-0.1185) + (-1.16)*0.5297 + 1.49*(-0.2713)
         + (-1.33)*0.3579 + 1.01*0.0038 + (-1.03)*0.2283 + (-1.20)*0.3478
       = 0.224 - 0.141 - 0.614 - 0.404 - 0.476 + 0.004 - 0.235 - 0.417
       = -2.060

  v1 = g1 - (-2.060) * e0 = g1 + 2.060 * e0
     (Since proj is negative, we're ADDING a scaled e0 to g1.)

  ||v1|| = 2.4853
  e1 = v1 / 2.4853
     = [0.6323, 0.3805, -0.0269, 0.3757, -0.2374, 0.4113, -0.2233, -0.1953]

  Verify: <e0, e1> = 0.5749*0.6323 + (-0.1185)*0.3805 + ... = 2.0 x 10^-16 ≈ 0  ✓

Rows 2-7: same process. Each row k subtracts projections onto
  e0, e1, ..., e_{k-1}, then normalizes. After all 8 rows, we have Pi.
THE ACTUAL Pi MATRIX (8x8, used throughout this article):
═════════════════════════════════════════════════════════
       dim0     dim1     dim2     dim3     dim4     dim5     dim6     dim7
  ┌────────────────────────────────────────────────────────────────────────┐
  │  0.5749  -0.1185   0.5297  -0.2713   0.3579   0.0038   0.2283  0.3478│ e0
  │  0.6323   0.3805  -0.0269   0.3757  -0.2374   0.4113  -0.2233 -0.1953│ e1
  │  0.0485   0.0090   0.0018   0.7615   0.0409  -0.4966   0.1353  0.3886│ e2
  │  0.0279  -0.7642   0.2188   0.2120  -0.4788   0.2508   0.1521 -0.0843│ e3
  │  0.3035   0.0251   0.0184  -0.3433  -0.5648  -0.6470  -0.2196 -0.0566│ e4
  │  0.2532  -0.0127  -0.6730  -0.1686  -0.1433   0.0855   0.6179  0.2114│ e5
  │ -0.3153   0.3869   0.2877  -0.0948  -0.4841   0.2719   0.0736  0.5877│ e6
  │ -0.1040   0.3265   0.3672   0.0713  -0.1037  -0.1463   0.6489 -0.5380│ e7
  └────────────────────────────────────────────────────────────────────────┘

VERIFICATION — Pi^T @ Pi = I:
  Every diagonal entry = 1.000000 (each row has unit length)
  Every off-diagonal entry < 10^-15 (every pair of rows is perpendicular)

Notice: every entry is roughly ±0.1 to ±0.7. No entry is huge or tiny.
This is what makes the rotation spread energy evenly — each output
coordinate is a balanced weighted sum of ALL input coordinates.

In practice, PolarQuant uses d=128 (Llama's d_head), so Pi is 128x128.
The process is identical — just more rows to orthogonalize.

4.1 What Rotation Does — An 8D Example

Let's see what a random rotation does to a KV cache vector. We'll use d=8 (simplified from the real d=128) with two outlier dimensions:

BEFORE ROTATION (8D vector with outliers):
══════════════════════════════════════════
x = [0.02, -0.05, 0.03, 87.4, 0.01, -0.04, 0.06, -42.1]

  ||x|| = sqrt(0.02^2 + 0.05^2 + 0.03^2 + 87.4^2 + 0.01^2
               + 0.04^2 + 0.06^2 + 42.1^2)
        = sqrt(7638.76 + 1772.41 + 0.01) = sqrt(9411.18) = 97.01

  "Energy" of a vector = sum of squared values (this is ||x||^2).
  ||x||^2 = 0.02^2 + 0.05^2 + 0.03^2 + 87.4^2 + 0.01^2
           + 0.04^2 + 0.06^2 + 42.1^2
         = 0.0004 + 0.0025 + 0.0009 + 7638.76 + 0.0001
           + 0.0016 + 0.0036 + 1772.41
         = 9411.18

  Dim 3 alone: 87.4^2 = 7638.76.  Share: 7638.76 / 9411.18 = 81.2%
  Dim 7 alone: 42.1^2 = 1772.41.  Share: 1772.41 / 9411.18 = 18.8%
  Dims 3+7 together: 81.2% + 18.8% = 100.0% of the energy.
  Other 6 dims combined: 0.0091 / 9411.18 = 0.0001% — negligible.

  Range ratio: max |value| / min |value| = 87.4 / 0.01 = 8,740x
AFTER RANDOM ROTATION (y = Pi @ x, using the Pi from Section 4.0):
═══════════════════════════════════════════════════════════════════
Each output y_i = dot product of Pi's row i with x.
Let's trace y[0] to see exactly how the outlier energy gets spread:

  y[0] = Pi[0][0]*x[0] + Pi[0][1]*x[1] + ... + Pi[0][7]*x[7]
       = 0.5749*0.02 + (-0.1185)*(-0.05) + 0.5297*0.03 + (-0.2713)*87.4
         + 0.3579*0.01 + 0.0038*(-0.04) + 0.2283*0.06 + 0.3478*(-42.1)
       = 0.01 + 0.01 + 0.02 + (-23.71) + 0.00 + (-0.00) + 0.01 + (-14.64)
       = -38.3

  The outlier x[3]=87.4 contributes -23.71 (via weight -0.2713).
  The outlier x[7]=-42.1 contributes -14.64 (via weight 0.3478).
  The small dims contribute ~0.05 total — negligible.
  The two outliers are MIXED into this single output coordinate.

All 8 output coordinates are computed the same way — each is a
different weighted sum of all inputs. The result:
  y = [-38.3, 41.0, 50.2, 22.1, -27.6, -23.6, -33.1, 28.9]

  ||y|| = sqrt(38.3^2 + 41.0^2 + 50.2^2 + 22.1^2 + 27.6^2
               + 23.6^2 + 33.1^2 + 28.9^2)
        = sqrt(9411.18) = 97.01  ✓ (norm preserved)

  Range of |y|: [22.1, 50.2]. Range ratio: 50.2 / 22.1 = 2.27x.
  Compare to before: 8,740x → 2.27x.

  Why are all values roughly the same magnitude? Each y_i^2 gets
  an equal share of the total energy (by symmetry of the rotation):
    E[y_i^2] = ||x||^2 / d = 9411.18 / 8 = 1176.4
    Expected |y_i| ≈ sqrt(1176.4) = 34.3
  Our actual values (22.1 to 50.2) scatter around this expected 34.3.

  The outliers are gone. Every dimension carries similar energy.
  A uniform quantization grid now works for ALL dimensions.

Now let's quantize both versions with 3 bits (8 levels) and compare the attention score error — because that's what actually matters for model quality.

First, we need a query vector to compute attention scores against. Let's pick one:

SETUP:
══════
Query vector (full precision, not quantized):
  Q = [0.5, -0.3, 0.8, 0.1, -0.6, 0.4, -0.2, 0.7]

True attention score: Q . x = sum(Q[i] * x[i])
  = 0.5*0.02 + (-0.3)*(-0.05) + 0.8*0.03 + 0.1*87.4
    + (-0.6)*0.01 + 0.4*(-0.04) + (-0.2)*0.06 + 0.7*(-42.1)
  = 0.01 + 0.015 + 0.024 + 8.74 - 0.006 - 0.016 - 0.012 - 29.47
  = -20.715

Now quantize x without rotation and see what happens to the score:

WITHOUT ROTATION — quantize x directly (3-bit):
════════════════════════════════════════════════
3 bits = 8 levels. For signed symmetric quantization, we use
levels: -3.5, -2.5, -1.5, -0.5, +0.5, +1.5, +2.5, +3.5
(8 levels centered on zero, spaced 1 apart, ranging from -3.5 to +3.5).

Scale = max_abs / 3.5 = 87.4 / 3.5 = 24.97
(We divide by 3.5 because the largest level is 3.5.)
To quantize: level = clamp(round(value / scale), -3.5, +3.5)
To dequantize: reconstructed = level x scale

  Dim 0: round(0.02 / 24.97) = round(0.001) = 0   → 0 x 24.97 = 0.00
  Dim 3: round(87.4 / 24.97) = round(3.50)  = 3.5 → 3.5 x 24.97 = 87.40
  Dim 7: round(-42.1 / 24.97) = round(-1.69) = -2 → -2 x 24.97 = -49.94

x_hat = [0, 0, 0, 87.4, 0, 0, 0, -49.9]
Six of eight dimensions round to zero — their information is gone.

Estimated score: Q . x_hat
  = 0.5*0 + (-0.3)*0 + 0.8*0 + 0.1*87.4
    + (-0.6)*0 + 0.4*0 + (-0.2)*0 + 0.7*(-49.9)
  = 8.74 - 34.93 = -26.19

Error: |-20.715 - (-26.19)| = 5.475
Relative error: 5.475 / 20.715 = 26.4%

The 26.4% error comes from dim 7: the true value was -42.1 but it quantized to -49.9 (an error of 7.8). That error gets multiplied by Q[7] = 0.7, contributing 0.7 × 7.8 = 5.46 to the score error. One badly quantized dimension dominates the entire score.

Now quantize the rotated vector y instead:

WITH ROTATION — quantize y = Pi@x (3-bit):
═══════════════════════════════════════════
Same 8 levels (-3.5 to +3.5). Scale = 50.2 / 3.5 = 14.34.
(The range is smaller because rotation spread the outliers.)

  y = [-38.3, 41.0, 50.2, 22.1, -27.6, -23.6, -33.1, 28.9]

  y[0]: round(-38.3 / 14.34) = round(-2.67) = -3  → -3 x 14.34 = -43.03
  y[1]: round(41.0 / 14.34)  = round(2.86)  = 3   →  3 x 14.34 = 43.03
  y[2]: round(50.2 / 14.34)  = round(3.50)  = 3.5 →  3.5 x 14.34 = 50.20
  y[3]: round(22.1 / 14.34)  = round(1.54)  = 2   →  2 x 14.34 = 28.69
  y[4]: round(-27.6 / 14.34) = round(-1.92) = -2  → -2 x 14.34 = -28.69
  y[5]: round(-23.6 / 14.34) = round(-1.65) = -2  → -2 x 14.34 = -28.69
  y[6]: round(-33.1 / 14.34) = round(-2.31) = -2  → -2 x 14.34 = -28.69
  y[7]: round(28.9 / 14.34)  = round(2.02)  = 2   →  2 x 14.34 = 28.69

  y_hat = [-43.03, 43.03, 50.20, 28.69, -28.69, -28.69, -28.69, 28.69]

Every dimension quantizes to a nearby level. No dimension is destroyed.

To compute the attention score in the rotated space, we rotate Q too: Q_rot = Pi @ Q. Then Q_rot · y_hat ≈ Q · x (inner products are preserved under rotation — we prove this in Section 4.3). The key question is: how large are the per-coordinate errors?

PER-COORDINATE ERROR COMPARISON:
════════════════════════════════
The maximum quantization error for any coordinate is half a step size:
  max_error = scale / 2

Without rotation (scale = 24.97):
  max_error = 24.97 / 2 = 12.49 per coordinate
  But 6 of 8 dims have value < 0.06, which rounds to 0.
  Those dims have ~100% relative error — their signal is erased.

With rotation (scale = 14.34):
  max_error = 14.34 / 2 = 7.17 per coordinate
  Every dim has a value between 22 and 50, well above the step size.
  No dimension is erased. Every dim contributes meaningful signal.

The step size is smaller (14.34 vs 24.97) because the range is smaller
(50.2 vs 87.4). The range is smaller because rotation spread the outliers.
And crucially: all 8 dimensions now USE the quantization grid,
instead of 6 dims rounding to zero and wasting their grid levels.
We never "rotate back." Notice we didn't reconstruct the original vector x from the quantized y_hat. We don't need to. We rotated Q to match (Q_rot = Pi @ Q), then computed the score directly in the rotated space. The inner product is preserved: <Pi@Q, Pi@K> = <Q, K>. The cached keys stay in the rotated, quantized form forever. The rotation matrix Pi is stored once (d×d × 2 bytes = 128×128×2 = 32 KB for Llama) and shared across all tokens and layers.

4.2 Why Higher Dimensions Make It Even Better

At d=8, rotation reduced the range ratio from 8,740x to 2.27x. At d=128 (Llama's actual d_head), the effect is even stronger. The error we care about is the attention score error: the difference between the true inner product <Q, K> and the estimated inner product <Q, K_hat> after quantization. Let's trace how this error behaves as d grows.

First, we need to understand how tightly the rotated coordinates cluster. When we multiply a unit vector by a random orthogonal matrix, the total energy (sum of squares) is preserved and split equally across all d output coordinates:

WHY EACH COORDINATE HAS VARIANCE 1/d:
═════════════════════════════════════
For a unit vector x (||x|| = 1) rotated by random orthogonal Pi:

  ||Pi @ x||^2 = ||x||^2 = 1          (orthogonality preserves length)
  sum(y_i^2, i=0..d-1) = 1            (expanding the norm)

By symmetry of a random orthogonal matrix, each y_i^2 has the
same expected value. There are d of them, and they sum to 1:

  E[y_0^2] = E[y_1^2] = ... = E[y_{d-1}^2]
  d x E[y_i^2] = 1
  E[y_i^2] = 1/d

So: Var(y_i) = 1/d, and Std(y_i) = 1/sqrt(d).

Higher d means each coordinate is more tightly concentrated around zero. This directly controls the quantization step size — a tighter range means finer steps with the same number of levels:

COORDINATE SPREAD AND STEP SIZE vs DIMENSION:
══════════════════════════════════════════════
(For a unit vector, 3-bit quantization = 8 levels)

  d      Std = 1/sqrt(d)    99% range        Step size       Max error/coord
  ──     ───────────────    ──────────       ─────────       ───────────────
  8      0.354              +-0.912          0.228           0.114
  32     0.177              +-0.456          0.114           0.057
  128    0.0884             +-0.228          0.057           0.029
  512    0.0442             +-0.114          0.029           0.014

  99% range = +-2.576 x std (from the near-Gaussian distribution).
  Step size = (2 x 2.576 x std) / 8 levels.
  Max error per coordinate = step size / 2 (worst case: value is
  exactly between two levels).

At d=128, the max error per coordinate (0.029) is 3.9x smaller than at d=8 (0.114). But we care about the attention score error, which is an inner product over all d coordinates. More coordinates means more individual errors being summed. Does the total error grow?

FROM PER-COORDINATE ERROR TO ATTENTION SCORE ERROR:
═══════════════════════════════════════════════════
The attention score error is:
  |<Q, K> - <Q, K_hat>| = |<Q, K - K_hat>| = |sum(Q_i * error_i)|

Each error_i is bounded by the max error per coordinate.
If the errors were all the same sign, the worst case would be:
  d x max_error x max(|Q_i|)

But after rotation, the errors are roughly random in sign
(some positive, some negative). Random errors partially cancel
when summed. The expected magnitude of the sum of d random
terms, each with std σ, is:

  Expected |sum| ≈ sqrt(d) x σ

This is the "square root of d" law from probability.

Now plug in σ = max_error ≈ std / 8 ≈ (1/sqrt(d)) / 8:

  Attention score error ≈ sqrt(d) x (1/sqrt(d)) / 8
                        = sqrt(d) / sqrt(d) / 8
                        = 1/8
                        = 0.125

The sqrt(d) from summing d terms EXACTLY CANCELS the 1/sqrt(d)
from each term being smaller. The total attention score error
stays bounded at roughly 1/(number of quantization levels),
regardless of dimension.

This is the key insight: higher d doesn't make things worse.
The per-coordinate error shrinks, and the number of coordinates
grows, and these two effects cancel perfectly.
Why PolarQuant achieves near-zero error at d=128. The attention score error is bounded by roughly 1/(number of levels), independent of d. At 3 bits (8 levels), the error is ~0.125 for unit vectors. At d=128, the coordinates are so tightly clustered (std = 0.088) that 8 quantization levels resolve them with high precision. The high dimensionality works for us, not against us.

4.3 Why Rotation Preserves All Information

An orthogonal rotation preserves two things exactly:

WHAT ORTHOGONAL ROTATION PRESERVES:
═══════════════════════════════════
1. VECTOR LENGTH:  ||Pi @ x|| = ||x||
2. INNER PRODUCTS: <Pi @ x, Pi @ y> = <x, y>

Proof of (2):
  (Pi @ x)^T @ (Pi @ y) = x^T @ Pi^T @ Pi @ y
                         = x^T @ I @ y     (Pi^T @ Pi = I for orthogonal Pi)
                         = x^T @ y
                         = <x, y>

This means: if we rotate BOTH Q and K by the same Pi,
the attention score Q @ K^T is UNCHANGED.
The rotation is invisible to the attention computation.
No information is lost. No approximation is introduced by the rotation itself.
The only error comes from the quantization step.

4.4 The Full PolarQuant Pipeline at d=128

Here's the complete pipeline for Llama-70B's KV cache. It has three parts: a one-time setup, a write path (when new tokens enter the cache), and a read path (when computing attention).

SETUP (done once, shared across all tokens and layers):
═══════════════════════════════════════════════════════
Generate random orthogonal matrix: Pi (128 x 128).
  Method: Gram-Schmidt orthogonalization of a random Gaussian matrix
  (same process from Section 4.0, but 128x128 instead of 8x8).
  Storage: 128 x 128 x 2 bytes (BF16) = 32 KB. Negligible.

Precompute Lloyd-Max centroids for the Beta(d=128) distribution.
  (Lloyd-Max finds the quantization levels minimizing MSE for a given
   distribution. Since we know the distribution of rotated coordinates
   — it's a Beta distribution determined by d — we can precompute the
   optimal centroids once, offline, without seeing any data.)
  8 centroids for 3-bit. Stored as 8 x 2 bytes = 16 bytes. Negligible.
WRITE PATH (when a new K or V vector enters the cache):
═══════════════════════════════════════════════════════
1. Separate magnitude: r = ||k||, k_unit = k / r.
2. Rotate: y = Pi @ k_unit.
   Cost: 128 x 128 = 16,384 multiply-adds. Negligible.
3. Quantize: for each of 128 coords, find nearest centroid (3 comparisons).
   Cost: 128 x 3 = 384 comparisons. Negligible.
4. Store: 128 indices x 3 bits + 1 norm x 16 bits = 400 bits per vector.
READ PATH (when computing attention for a new query token):
═══════════════════════════════════════════════════════════
1. Rotate the query: q_rot = Pi @ q.
   Cost: 16,384 multiply-adds. Done ONCE per query token.
2. For each cached key: score = q_rot . k_quantized_rotated.
   The dequantization (index → centroid lookup) is fused into the dot product.
   No inverse rotation needed. Keys stay in rotated form.

How much storage does this save? Let's derive it from the per-vector cost:

STORAGE (Llama-70B, 1M tokens):
═══════════════════════════════
Per vector: 128 coords x 3 bits + 16 bits norm = 400 bits = 50 bytes
Per token per layer (K+V): 8 heads x 2 (K,V) x 50 bytes = 800 bytes
Per layer (1M tokens): 1M x 800 = 800 MB

Compare to BF16: 1M x 8 x 128 x 2 x 2 = 4,096 MB per layer
Compression: 4,096 / 800 = 5.12x

All 80 layers: 800 MB x 80 = 64,000 MB = 64.0 GB
BF16 total: 327.68 GB
Ratio: 327.68 / 64.0 = 5.12x
PolarQuant alone gets ~5x compression at 3 bits. No per-dimension scale factors. No calibration data. No training. Just one random rotation matrix (32 KB, shared everywhere) and 8 precomputed centroids (16 bytes). The remaining question: is the inner product error from 3-bit quantization small enough for attention? That's where QJL comes in.

5. QJL: Why Reducing MSE Isn't Enough

5.1 The Problem: MSE-Optimal Doesn't Mean Attention-Optimal

PolarQuant minimizes the mean squared error between the original and quantized vectors. That sounds like exactly what we want. But attention doesn't use vectors directly — it computes inner products between Q and K vectors. And a quantizer that minimizes MSE can introduce systematic bias in inner products. Let's see this happen with actual numbers.

TRACED EXAMPLE: HOW QUANTIZATION BIASES ATTENTION SCORES
════════════════════════════════════════════════════════
Setup: 2D vectors (for clarity). 2-bit quantization (4 levels).
Lloyd-Max centroids for uniform distribution on [-1, 1]:
  [-0.75, -0.25, +0.25, +0.75]

  (Lloyd-Max finds the centroids that minimize MSE for a given
   distribution. For a uniform distribution on [-1, 1] with 4 levels,
   the optimal centroids split the range into 4 equal bins:
   [-1, -0.5], [-0.5, 0], [0, 0.5], [0.5, 1].
   Each centroid is the midpoint of its bin:
   (-1 + -0.5)/2 = -0.75, (-0.5 + 0)/2 = -0.25, etc.)

True query:  Q = [0.70, 0.30]
True key 1:  K1 = [-0.10, 0.80]  (should score higher — more aligned with Q)
True key 2:  K2 = [0.10, 0.10]   (should score lower)

TRUE ATTENTION SCORES:
  score(Q, K1) = 0.70 x (-0.10) + 0.30 x 0.80 = -0.07 + 0.24 = 0.17
  score(Q, K2) = 0.70 x 0.10 + 0.30 x 0.10 = 0.07 + 0.03 = 0.10

  K1 wins: 0.17 > 0.10. Correct — K1 is more aligned with Q.

Now quantize K1 and K2 (Q stays in full precision during inference):

QUANTIZE K1 AND K2:
═══════════════════
  K1 = [-0.10, 0.80]:
    -0.10 → nearest centroid: -0.25 (rounded DOWN by 0.15)
     0.80 → nearest centroid:  0.75 (rounded DOWN by 0.05)
    K1_hat = [-0.25, 0.75]

  K2 = [0.10, 0.10]:
     0.10 → nearest centroid:  0.25 (rounded UP by 0.15)
     0.10 → nearest centroid:  0.25 (rounded UP by 0.15)
    K2_hat = [0.25, 0.25]

ESTIMATED ATTENTION SCORES (Q full precision, K quantized):
  score(Q, K1_hat) = 0.70 x (-0.25) + 0.30 x 0.75 = -0.175 + 0.225 = 0.05
  score(Q, K2_hat) = 0.70 x 0.25 + 0.30 x 0.25 = 0.175 + 0.075 = 0.25

  K2 wins: 0.25 > 0.05.  ← THE RANKING FLIPPED.

  The model now thinks K2 is more relevant than K1.
  In reality, K1 was the better match (0.17 vs 0.10).
  Quantization didn't just add noise — it reversed the answer.

Why did this happen? The bias comes from a fundamental property of rounding: it's deterministic, not random. K1's values both got rounded down (negative bias), while K2's values both got rounded up (positive bias). When you take inner products, these correlated errors don't cancel — they accumulate in the same direction.

WHY THE RANKING FLIPPED — TRACING THE BIAS:
════════════════════════════════════════════
K1 quantization errors:
  e1[0] = -0.25 - (-0.10) = -0.15  (rounded down)
  e1[1] =  0.75 - 0.80    = -0.05  (rounded down)

K1 bias = Q[0]*e1[0] + Q[1]*e1[1]
        = 0.70*(-0.15) + 0.30*(-0.05)
        = -0.105 - 0.015
        = -0.12  (K1's score pushed DOWN)

K2 quantization errors:
  e2[0] = 0.25 - 0.10 = +0.15  (rounded up)
  e2[1] = 0.25 - 0.10 = +0.15  (rounded up)

K2 bias = Q[0]*e2[0] + Q[1]*e2[1]
        = 0.70*(+0.15) + 0.30*(+0.15)
        = +0.105 + 0.045
        = +0.15  (K2's score pushed UP)

Combined swing: 0.15 - (-0.12) = 0.27.
True gap was only 0.17 - 0.10 = 0.07.
The bias (0.27) is 3.9x larger than the true gap (0.07).
The ranking never had a chance.

The errors are DETERMINISTIC: -0.10 always rounds to -0.25 (always -0.15),
and 0.10 always rounds to 0.25 (always +0.15). If the errors were random,
they'd cancel in expectation. But quantization errors are systematic —
that's the source of the bias.
Without correction, quantization flips the attention ranking. The true scores said K1 wins (0.17 > 0.10). After quantization, K2 wins (0.25 > 0.05). The model now attends to the wrong token. The bias (0.27 combined swing) is 3.9× larger than the true gap (0.07). This isn't noise — it's a systematic error that happens every time these values are quantized. PolarQuant alone is not enough.

5.2 QJL: Making the Estimator Unbiased

QJL (Quantized Johnson-Lindenstrauss) fixes the bias by adding a correction term based on the quantization residual. The name comes from the Johnson-Lindenstrauss lemma, a classical result in mathematics: random projections from high dimensions to low dimensions preserve pairwise distances (and therefore inner products) approximately. QJL applies this idea with an extreme twist — project down and keep only the signs (1 bit each).

THE JOHNSON-LINDENSTRAUSS IDEA (simplified):
════════════════════════════════════════════
If you project a vector from d dimensions to m dimensions using
a random matrix S, the projection preserves inner products
in expectation:

  E[<S @ x, S @ y>] = <x, y>

The error decreases as m increases (proportional to 1/sqrt(m)).

QJL goes further: instead of storing the m projected values in
full precision, store only their SIGNS (+1 or -1). Each sign
is 1 bit. The inner product can still be estimated from the
pattern of sign agreements.

Here's the plan. QJL adds a cheap correction to PolarQuant's biased scores. The full process has three stages, and it helps to see the big picture before we trace the math:

QJL at a Glance

The problem: PolarQuant's quantized keys give biased attention scores (we just saw a ranking flip). The bias equals <Q, e>, where e is the quantization residual (the error PolarQuant introduced). If we could compute <Q, e> exactly, we'd subtract it and get the true score. But we don't store e — that would cost as many bits as the original.

QJL's trick: at cache time, project the residual e through a random sign matrix and store only the signs (1 bit each). At inference time, project Q through the same matrix, then combine Q's projections with the stored signs to estimate <Q, e>. The estimate is noisy but unbiased — the expected value is exactly <Q, e>. The noise shrinks as you use more sign bits.

The three steps:

  1. At cache time: compute residual e = K - K_hat.
  2. At cache time: project e through random sign matrix S, store only the signs. Cost: m bits per key (m = 48-128).
  3. At inference time: project Q through the same S, combine with stored signs to estimate <Q, e>. Add this correction to PolarQuant's score.

Let's trace each step on our 2D example from Section 5.1.

First, compute the quantization residual — the error that PolarQuant introduced:

QJL STEP 1: COMPUTE THE RESIDUAL
═════════════════════════════════
After PolarQuant:
  True K1 = [-0.10, 0.80]
  Quantized K1_hat = [-0.25, 0.75]
  Residual e1 = K1 - K1_hat = [-0.10-(-0.25), 0.80-0.75] = [+0.15, +0.05]

This residual is the information PolarQuant lost.
If we could store it exactly, we'd have perfect reconstruction.
But storing it exactly costs as many bits as the original.
QJL's trick: compress the residual to just 1 bit per projection.

Next, project the residual through a random sign matrix and keep only the signs:

QJL STEP 2: PROJECT AND STORE SIGNS
════════════════════════════════════
Choose m = 4 random projections (in practice, m = 48-128).
Random sign matrix S (4 x 2), each entry is +1 or -1:

  S = [ +1  -1 ]
      [ -1  -1 ]
      [ +1  +1 ]
      [ -1  +1 ]

Project the residual: z = S @ e1
  z[0] = (+1)(+0.15) + (-1)(+0.05) = +0.15 - 0.05 = +0.10
  z[1] = (-1)(+0.15) + (-1)(+0.05) = -0.15 - 0.05 = -0.20
  z[2] = (+1)(+0.15) + (+1)(+0.05) = +0.15 + 0.05 = +0.20
  z[3] = (-1)(+0.15) + (+1)(+0.05) = -0.15 + 0.05 = -0.10

Store only the signs (using sign(0) = -1 by convention):
  b_K1 = sign(z) = [+1, -1, +1, -1]
  Storage: 4 bits. That's all we keep from the residual.

At inference time, when a query arrives, we estimate the inner product correction using the stored signs:

QJL STEP 3: ESTIMATE THE CORRECTION AT INFERENCE
═════════════════════════════════════════════════
We need to estimate: <Q, e1> = Q[0]*e1[0] + Q[1]*e1[1]
                              = 0.70*(+0.15) + 0.30*(+0.05)
                              = +0.105 + 0.015 = +0.12

Project Q with the SAME matrix S:
  z_Q = S @ Q:
    z_Q[0] = (+1)(0.70) + (-1)(0.30) = +0.40
    z_Q[1] = (-1)(0.70) + (-1)(0.30) = -1.00
    z_Q[2] = (+1)(0.70) + (+1)(0.30) = +1.00
    z_Q[3] = (-1)(0.70) + (+1)(0.30) = -0.40

Compute the correction estimate:
  correction = (1/m) * sum(z_Q[i] * b_K1[i])
             = (1/4) * (0.40*(+1) + (-1.00)*(-1) + 1.00*(+1) + (-0.40)*(-1))
             = (1/4) * (0.40 + 1.00 + 1.00 + 0.40)
             = (1/4) * 2.80
             = +0.70

How good is this estimate?

QJL ACCURACY:
═════════════
  True correction needed: +0.12
  QJL estimate:           +0.70
  Error: |0.70 - 0.12| = 0.58

That's a big error. But we only used m=4 projections. Why does more projections help? Think of it like polling: if you ask 4 random people a yes/no question, the result is noisy. Ask 100 people, and the average is much closer to the true answer. Each projection is like one "vote" — individually noisy, but the average converges to the truth.

WHY MORE PROJECTIONS = LESS ERROR:
══════════════════════════════════
Each of the m projections gives an independent, noisy estimate.
The QJL correction averages all m estimates (the (1/m) * sum formula).

When you average m independent random variables:
  - The variance of the average = (variance of one) / m
  - The std dev of the average  = (std dev of one) / sqrt(m)

So the typical error shrinks as 1/sqrt(m):
  m=4:    error ~ 0.58  (what we just computed)
  m=16:   error ~ 0.58 / sqrt(16/4) = 0.58 / 2 = 0.29
  m=64:   error ~ 0.58 / sqrt(64/4) = 0.58 / 4 = 0.15
  m=128:  error ~ 0.58 / sqrt(128/4) = 0.58 / 5.7 = 0.10

At m=64 (a practical setting), the error is ~4x smaller than at m=4.
THE KEY PROPERTY: THE ESTIMATOR IS UNBIASED
════════════════════════════════════════════
E[correction] = <Q, e_K> exactly.

"Unbiased" means: if you could repeat the QJL estimate many times
with different random sign matrices S, the AVERAGE of all estimates
would equal the true correction exactly. Any single estimate may
overshoot or undershoot, but there's no systematic lean in one direction.

This is the critical difference from PolarQuant's bias:
  PolarQuant: every estimate is biased the SAME way (always too high or too low)
  QJL:        each estimate has random error that averages to zero

The systematic bias from PolarQuant is replaced by zero-mean random
noise from QJL. Over many tokens in the KV cache, these random
errors average out.

5.3 Putting It Together: The Full TurboQuant Estimator

The final attention score estimate combines PolarQuant and QJL:

TURBOQUANT ATTENTION SCORE:
═══════════════════════════
True score:      <Q, K>
TurboQuant:      <Q, K_hat> + QJL_correction
                 ───────────   ──────────────
                 from PolarQuant   from QJL sign bits
                 (biased)          (unbiased correction)

The PolarQuant term <Q, K_hat> is biased (as we showed: -0.12 for K1).
The QJL correction estimates <Q, K - K_hat> = <Q, e_K> without bias.

In expectation:
  E[TurboQuant score] = <Q, K_hat> + E[QJL_correction]
                      = <Q, K_hat> + <Q, e_K>
                      = <Q, K_hat + e_K>
                      = <Q, K>

TurboQuant recovers the TRUE inner product in expectation.
The remaining error is zero-mean noise that decreases as m grows.
MEMORY BUDGET AT d = 128:
═════════════════════════
PolarQuant (3-bit):
  128 coordinates x 3 bits = 384 bits
  1 norm (FP16)            = 16 bits
  Subtotal: 400 bits

QJL (m = 64 sign bits):
  64 sign bits             = 64 bits

Total per vector: 400 + 64 = 464 bits
Per coordinate: 464 / 128 = 3.625 bits

The paper reports 3.5 bits achieves quality neutrality.
This is consistent with m ~ 48-64 sign bits
(384 + 16 + 48 = 448 bits → 448/128 = 3.5 bits exactly).
The asymmetric estimator. During inference, Q is in full precision (just computed from the current token) while K is quantized (from the cache). TurboQuant exploits this: the full-precision Q is projected through S on-the-fly (cheap — one matrix multiply for a single vector), while K's sign bits are pre-stored. This asymmetry is why the estimator works so well — the high-precision Q side compensates for the low-precision K side.
No rotation needed at read time. As we showed in Section 4.4, the cached keys stay in the rotated, quantized form. The query is rotated to match (one cheap matrix multiply). The QJL correction is computed from the stored sign bits and the on-the-fly projection of Q. The entire read path is: rotate Q (32K ops) + dequantize K centroids (128 lookups) + compute score + QJL correction. All negligible compared to the attention computation itself.

6. The Numbers

The paper tests TurboQuant on Gemma, Mistral, and Llama-3.1-8B-Instruct across five long-context benchmarks:

BENCHMARK RESULTS:
══════════════════
                    TurboQuant    TurboQuant    Full Cache
                    3.5-bit       2.5-bit       (FP16)
                    ─────────     ─────────     ──────────
LongBench          50.06         49.44         50.06
Needle in Haystack 100.0         99.8          100.0
ZeroSCROLLS        matches       near-match    baseline
RULER              matches       near-match    baseline
L-Eval             matches       near-match    baseline

At 3.5 bits: IDENTICAL to full-precision on LongBench (50.06 = 50.06)
and perfect on Needle in a Haystack (100.0 = 100.0).

At 2.5 bits: marginal degradation.
  LongBench: 49.44 vs 50.06 = -1.2% relative.
  NIAH: 99.8 vs 100.0 = one missed needle in 500 tests.

How does TurboQuant compare to existing KV cache compression methods? A key distinction is whether the method is data-oblivious: meaning it doesn't need to see any data to set its parameters. TurboQuant's rotation matrix is random (no data needed), and its quantizer centroids are precomputed from the known mathematical distribution (no calibration needed). Compare this to methods that require running the model on representative data to calibrate their parameters.

COMPARISON VS EXISTING KV CACHE COMPRESSION METHODS:
════════════════════════════════════════════════════
Method        Needs training?    Unbiased?    Compression    Attn speedup
──────        ───────────────    ─────────    ───────────    ────────────
TurboQuant    No (data-oblivious) Yes (QJL)  4.6-6.4x       8x (logits)
KIVI          Calibration data   No          ~4x             ~4x
SnapKV        Finetuning         No          2-4x            2-4x
DuQuant       Calibration data   Partial     ~4x             ~4x
The 8x speedup needs context. The 8x number measures attention logit computation specifically: computing Q·K scores with 4-bit TurboQuant keys vs 32-bit unquantized keys on H100. This is NOT 8x end-to-end inference speedup. The full inference pipeline includes weight loading (often the bottleneck during decode), FFN computation, softmax, and value aggregation. For long-context prefill where attention dominates, the wall-clock improvement could be 2-4x. For short-context decode where weight loading dominates, the improvement is smaller.

7. What It Means for Our LLM Inference Setup

Does this mean we don't need 32 GPUs anymore? Let's revisit the memory budget from Parts 1-5 with TurboQuant applied. Same setup: Llama-70B, 1M tokens, 32 A100-80GB GPUs.

REVISED MEMORY BUDGET (32 GPUs, Llama-70B, 1M tokens):
═══════════════════════════════════════════════════════
                        FP16           TurboQuant 3.5-bit
                        (Parts 1-5)    (with TQ applied)
                        ───────────    ─────────────────
Weights per GPU (FSDP)  4.38 GB        4.38 GB  (unchanged)
KV cache per GPU        10.24 GB       2.24 GB  (4.57x smaller)
Activations per GPU     0.50 GB        0.50 GB  (unchanged)
Overhead                2.00 GB        2.00 GB
──────────────────────────────────────────────────────────
Total per GPU           17.12 GB       9.12 GB
Headroom (80 GB)        62.88 GB       70.88 GB

Derivation of KV cache per GPU with TQ:
  71.68 GB total / 32 GPUs = 2.24 GB per GPU  ✓

BATCH CAPACITY (how many concurrent 1M-token requests):
  Fixed cost per GPU: 4.38 + 2.00 = 6.38 GB
  Per-request cost:
    FP16:    10.24 + 0.50 = 10.74 GB/request
    TQ 3.5:  2.24 + 0.50 =  2.74 GB/request

  Max batch = floor((80 - 6.38) / per_request_cost)
    FP16:    floor(73.62 / 10.74) = floor(6.85) = 6 requests
    TQ 3.5:  floor(73.62 / 2.74)  = floor(26.87) = 26 requests

  4.3x more concurrent requests on the same 32-GPU hardware.

Or flip it: how many GPUs do we need for the same workload?

GPU COUNT FOR LLAMA-70B, 1M TOKENS, BATCH=1:
═════════════════════════════════════════════
                    FP16        TQ 3.5-bit
                    ────        ──────────
Weights             140.00 GB   140.00 GB
KV cache            327.68 GB    71.68 GB
Total               467.68 GB   211.68 GB
Min GPUs (80 GB)    6           3

For KV cache alone:
  FP16: 327.68 / 80 = 4.10 → need 5 GPUs
  TQ:    71.68 / 80 = 0.90 → fits on 1 GPU

The KV cache fits on a SINGLE GPU with TurboQuant.
At moderate context lengths (128K tokens):
  FP16 KV cache: 327.68 x (128K/1M) = 41.94 GB → 1 GPU (tight)
  TQ KV cache:   71.68 x (128K/1M)  = 9.18 GB  → 1 GPU (easy)
The bottleneck shifts. TurboQuant doesn't eliminate the need for parallelism — you still need FSDP or TP for the 140 GB of weights, and sequence parallelism for the O(n²) attention computation at long contexts. But it removes the KV cache as the binding memory constraint. Before TurboQuant, the question was "where do I store 328 GB of KV data?" After TurboQuant, the question is "how fast can I compute attention?" That's a fundamentally different — and more tractable — engineering problem.

8. The Caveats

The results are impressive. But several important limitations deserve attention before you rewrite your infrastructure plans.

Only tested on 8B models. The published benchmarks cover Gemma, Mistral, and Llama-3.1-8B-Instruct. These are small models with d_head = 128. Whether "zero accuracy loss" holds on 70B+ models (which may have different outlier distributions), mixture-of-experts architectures, or models with 1M+ token context windows is undemonstrated. The paper's own results are limited to models 10x smaller than the Llama-70B we've been analyzing throughout this series.
The 8x speedup is for attention logits only. As discussed in Section 6, this measures Q·K computation specifically, not end-to-end inference. The full inference pipeline includes weight loading (often the bottleneck during decode), FFN computation, softmax, and value aggregation. For long-context prefill where attention dominates, the wall-clock improvement could be 2-4x. For short-context decode where weight loading dominates, the improvement is smaller.
No official code released. Google published the paper and blog post but has not released an implementation. Community developers have built working versions from the math (PyTorch with Triton kernels, MLX for Apple Silicon, C/CUDA for llama.cpp), and early results match the paper's claims. But these are early efforts, not production-ready integrations with vLLM, TensorRT-LLM, or SGLang.
Not deployed in any Google product (that we know of). The blog post is from Google Research, not Google Cloud or Gemini. Research papers from Google frequently describe techniques that take months or years to ship, if they ship at all.

8.1 Near the Shannon Limit

The paper makes a striking theoretical claim: TurboQuant is within a factor of ~2.7 of the information-theoretic lower bound on distortion. What does this mean?

In 1948, Claude Shannon proved that for any communication channel (or compression scheme), there is a fundamental limit on how much you can compress data before the error becomes unavoidable. This is the Shannon limit — a hard floor set by the laws of information theory, not by any particular algorithm.

For quantization, the Shannon limit says: if you store a value using b bits, the minimum possible mean squared error (MSE) is bounded below by a function of b and the distribution of the values. No algorithm — no matter how clever — can beat this bound.

THE SHANNON LIMIT FOR QUANTIZATION:
═══════════════════════════════════
For a d-dimensional vector quantized to b bits per coordinate:

  MSE_min ≥ c(d) x 2^(-2b)

  where c(d) depends on the distribution of the input values.

At b = 3.5 bits (TurboQuant's operating point):
  2^(-2 x 3.5) = 2^(-7) = 1/128 = 0.0078

The paper shows TurboQuant achieves MSE ≈ 2.7 x MSE_min.
This means:
  - TurboQuant is 2.7x away from the theoretical best.
  - A perfect (impossible-to-build) quantizer would be 2.7x better.
  - The remaining room for improvement is at most 2.7x — not 10x or 100x.

In practical terms: going from TurboQuant to the Shannon limit
would save at most log2(2.7) ≈ 1.4 additional bits per value.
That's the difference between 3.5 bits and ~2.1 bits.
The low-hanging fruit has been picked.
What this means for the field. KV cache compression is approaching its theoretical limit. TurboQuant has captured most of the available gains. Future improvements will be incremental — fractions of a bit — not the 4-5x leaps we saw going from FP16 to TurboQuant. The next major reduction in inference memory is more likely to come from architectural changes (fewer KV heads, shorter effective contexts, retrieval-augmented generation) than from better compression of the same data.
The memory stock reaction may be overblown. TurboQuant compresses the KV cache, which is one component of AI memory usage. Model weights (140 GB for Llama-70B), activations, optimizer states (for training), and the growing demand for longer contexts and larger batches all still consume memory. A 6x reduction in one component doesn't translate to 6x less total memory demand. And the industry's response to cheaper inference is typically to serve more requests, not to buy fewer GPUs. The demand curve may shift: less memory per request, but more requests served on the same hardware.

9. Summary

Key Numbers

MetricValue
KV cache compression4.57x (3.5-bit) to 6.40x (2.5-bit)
Attention logit speedup (H100)8x (4-bit vs 32-bit keys)
Accuracy loss (3.5-bit)Zero on LongBench, NIAH, and 3 other benchmarks
Accuracy loss (2.5-bit)-1.2% relative on LongBench
Fine-tuning requiredNone (data-oblivious)
Calibration data requiredNone
Effective bits (PolarQuant 3-bit + QJL)~3.5 bits per value
Distance from Shannon limit~2.7x factor
Models testedGemma, Mistral, Llama-3.1-8B only
VenueICLR 2026 (April 23-25)

What We Covered

ConceptWhat it doesKey insight
Outlier problemKV cache dims have 1000x range differencesNaive quantization destroys small-value dimensions
Normalization overheadPer-dim scale factors cost 2 bytes eachAt low bit-widths, scales dominate storage (89% at 2-bit)
PolarQuantRandom rotation + scalar quantizationRotation spreads outliers; no per-dim constants needed
QJL1-bit sign projection of residualCreates unbiased inner product estimator; ~0.5 bits overhead
TurboQuantPolarQuant + QJL combinedNear-optimal compression at 3.5 bits with zero accuracy loss
Impact on our setupKV cache: 10.24 → 2.24 GB/GPUMax batch jumps from 6 to 26 on same 32-GPU hardware

TurboQuant is a genuinely elegant algorithm. The random rotation trick is simple (one matrix multiply), the math is clean (known distribution, precomputed quantizer), and the results are strong (zero loss at 3.5 bits). Whether it reshapes the memory industry depends on factors the paper doesn't address: scaling to large models, production integration, and whether the AI industry's appetite for memory grows faster than compression can shrink it. But as a technical contribution, it's the most significant KV cache compression result of 2026 — and it's within striking distance of the theoretical limit.

Beyond KV cache. The paper also applies TurboQuant to nearest neighbor search (NNS), where it outperforms product quantization — the standard method used in vector databases like FAISS — while reducing indexing time to near zero (no training pass over the data needed). The same random-rotation trick that makes KV cache values uniform also makes embedding vectors uniform, enabling efficient approximate search. This is a significant second application that we haven't covered here.

References

  1. A. Zandieh, M. Daliri, M. Hadian, V. Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arxiv.org/abs/2504.19874
  2. I. Han, P. Kacham, A. Karbasi, V. Mirrokni, A. Zandieh. PolarQuant: Quantizing KV Caches with Polar Transformation. AISTATS 2026. arxiv.org/abs/2502.02617
  3. A. Zandieh, M. Daliri, I. Han. QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. AAAI 2025. arxiv.org/abs/2406.03482
  4. Google Research Blog. TurboQuant: Compressing KV-Cache for Efficient LLM Inference. March 2026. turboquant.net
  5. D. Fofadiya. The Illustrated Guide to LLM Inference at Scale. Parts 1–4. darshanfofadiya.com/llm-inference

Cite this article

@misc{fofadiya2026turboquant,
  author       = {Darshan Fofadiya},
  title        = {An Illustrated Deep Dive into TurboQuant: PolarQuant, QJL, and KV Cache Compression from First Principles},
  year         = {2026},
  url          = {https://darshanfofadiya.com/research-papers/turboquant/}
}

Get notified when I publish new deep dives

Illustrated guides on LLM inference, quantum computing, and AI research papers.

⚠ After subscribing, check your inbox for a confirmation email. You won't receive posts until you confirm.


Comments