Part 3: Sequence Sharding (Ulysses)

Splitting 1M tokens across 8 GPUs with All-to-All

Part 1: GPU Memory Part 2: FSDP Part 3: Ulysses Part 4: Ring Attention Part 5: USP

In Part 2, we solved the weight problem: FSDP distributes 140 GB of weights across 8 GPUs at 17.5 GB each. But we still have two unsolved problems from Part 1:

Activations: Q + K + V for 1M tokens = 20.5 GB per layer (26% of one GPU)
KV Cache: 328 GB across 80 layers (4× one GPU)

FSDP shards weights. Can we shard the sequence instead? That's exactly what Ulysses does. In this part, we'll build up the idea step by step:

First, we'll see how splitting the sequence reduces activation memory
Then, we'll understand why attention breaks when tokens are on different GPUs
Next, we'll learn the All-to-All trick that fixes it
Then, we'll compare this to Tensor Parallelism — the "obvious" alternative — and see why Ulysses wins
Finally, we'll see the full Ulysses pipeline and its limitations

Where we are: We identified 3 bottlenecks in Part 1. FSDP solved #1 (weights). Now we tackle #2 (activations). Ring Attention (Part 4) will handle #3 (KV cache).

3.1 Sequence Sharding: The Core Idea

The idea is simple: instead of every GPU processing all 1M tokens, each GPU processes only 1/N of the tokens.

3.1.1 How the Sequence Gets Split

With 8 GPUs and 1M tokens, each GPU gets a contiguous chunk:

Total sequence: 1,000,000 tokens
Number of GPUs: 8

GPU 0: tokens [0,       124,999]     → 125,000 tokens
GPU 1: tokens [125,000, 249,999]     → 125,000 tokens
GPU 2: tokens [250,000, 374,999]     → 125,000 tokens
...
GPU 7: tokens [875,000, 999,999]     → 125,000 tokens

3.1.2 Memory Savings

Each GPU now computes Q, K, V only for its local 125K tokens:

BEFORE (single GPU, full sequence):
  Q: [1, 1,000,000, 8192] × 2 bytes = 16.38 GB   (64 query heads)
  K: [1, 1,000,000, 1024] × 2 bytes = 2.05 GB    (8 KV heads, GQA)
  V: [1, 1,000,000, 1024] × 2 bytes = 2.05 GB    (8 KV heads, GQA)
  Total: 20.48 GB  ← on top of weights that already don't fit!

AFTER (8 GPUs, sharded sequence):
  Q: [1, 125,000, 8192] × 2 bytes = 2.05 GB
  K: [1, 125,000, 1024] × 2 bytes = 0.26 GB
  V: [1, 125,000, 1024] × 2 bytes = 0.26 GB
  Total per GPU: 2.56 GB  ← fits easily!

8× memory reduction: Activation memory drops from 20.5 GB to ~2.6 GB per GPU. Combined with FSDP's 17.5 GB weight shard, each GPU now uses ~20 GB — well within the A100's 80 GB budget.

3.1.3 What About the FFN?

The feed-forward network (FFN) processes each token independently — there's no interaction between tokens. This means sequence sharding works perfectly for FFN with zero communication:

FFN computation (per token, independent):
  hidden = SiLU(x @ W_gate) * (x @ W_up)
  output = hidden @ W_down

GPU 0 processes tokens 0-124,999 through FFN → no communication needed
GPU 1 processes tokens 125,000-249,999 through FFN → no communication needed
...
GPU 7 processes tokens 875,000-999,999 through FFN → no communication needed

FFN is "embarrassingly parallel" across the sequence dimension. Each GPU applies the same weights to its local tokens independently. This is a huge advantage of sequence sharding.

But there's a problem. The FFN is fine — but what about attention?

3.2 The Attention Problem

Attention is fundamentally different from FFN. In attention, every token needs to attend to every other token:

Attention(Q, K, V) = softmax(Q @ Kᵀ / √d) @ V

For token 0 to compute its attention output:
  - It needs Q₀ (its own query)           ← GPU 0 has this
  - It needs K for ALL 1M tokens           ← spread across 8 GPUs!
  - It needs V for ALL 1M tokens           ← spread across 8 GPUs!

3.2.1 What Happens with Local-Only Attention

If each GPU only uses its local K and V, we get partial attention — each token only attends to tokens on the same GPU:

LOCAL-ONLY ATTENTION (WRONG):
─────────────────────────────
GPU 0: tokens 0-124,999 attend to tokens 0-124,999 only
GPU 1: tokens 125,000-249,999 attend to tokens 125,000-249,999 only
...

Token 0 CANNOT see token 999,999!
Token 124,999 CANNOT see token 125,000 (its immediate neighbor)!

This is fundamentally wrong. The whole point of attention is that any token can attend to any other token. Local-only attention would be like reading a book but only being able to reference the current chapter — you'd lose all cross-chapter context.

3.2.2 The Attention Matrix View

Let's visualize what the attention matrix looks like with sequence sharding:

FULL ATTENTION MATRIX (what we need):
─────────────────────────────────────
         K₀      K₁      K₂      K₃      K₄      K₅      K₆      K₇
        (GPU0)  (GPU1)  (GPU2)  (GPU3)  (GPU4)  (GPU5)  (GPU6)  (GPU7)
Q₀ (GPU0) [████    ████    ████    ████    ████    ████    ████    ████]
Q₁ (GPU1) [████    ████    ████    ████    ████    ████    ████    ████]
Q₂ (GPU2) [████    ████    ████    ████    ████    ████    ████    ████]
Q₃ (GPU3) [████    ████    ████    ████    ████    ████    ████    ████]
Q₄ (GPU4) [████    ████    ████    ████    ████    ████    ████    ████]
Q₅ (GPU5) [████    ████    ████    ████    ████    ████    ████    ████]
Q₆ (GPU6) [████    ████    ████    ████    ████    ████    ████    ████]
Q₇ (GPU7) [████    ████    ████    ████    ████    ████    ████    ████]

LOCAL-ONLY ATTENTION (what each GPU can compute):
─────────────────────────────────────────────────
         K₀      K₁      K₂      K₃      K₄      K₅      K₆      K₇
Q₀ (GPU0) [████    ----    ----    ----    ----    ----    ----    ----]
Q₁ (GPU1) [----    ████    ----    ----    ----    ----    ----    ----]
Q₂ (GPU2) [----    ----    ████    ----    ----    ----    ----    ----]
...
Q₇ (GPU7) [----    ----    ----    ----    ----    ----    ----    ████]

Only the diagonal blocks! We're missing 7/8 of the attention matrix.

We need a way for each GPU to compute attention across the full sequence. There are two approaches:

Gather all K, V to every GPU — but that defeats the purpose of sharding (back to 20.5 GB per GPU)
Reshape the problem — redistribute the data so each GPU has what it needs

Ulysses takes approach #2 with a clever trick: the All-to-All operation.

3.3 The All-to-All Trick

Here's Ulysses's key insight: we don't need every GPU to have all K and V. We just need each GPU to compute complete attention for some heads.

Remember, multi-head attention splits the computation into independent heads:

Llama-70B has 64 query heads.
Each head computes attention independently:
  Head 0: Attention(Q_head0, K_head0, V_head0)
  Head 1: Attention(Q_head1, K_head1, V_head1)
  ...
  Head 63: Attention(Q_head63, K_head63, V_head63)

These heads don't interact with each other during attention!

So instead of each GPU having all heads for partial tokens, we can rearrange so each GPU has partial heads for all tokens:

BEFORE All-to-All (sequence-sharded):
─────────────────────────────────────
GPU 0: Q for tokens 0-124K,     ALL 64 heads
GPU 1: Q for tokens 125K-249K,  ALL 64 heads
...
GPU 7: Q for tokens 875K-999K,  ALL 64 heads

Each GPU: [batch, 125K tokens, 64 heads, 128 dim]

AFTER All-to-All (head-sharded):
────────────────────────────────
GPU 0: Q for ALL 1M tokens,  heads 0-7
GPU 1: Q for ALL 1M tokens,  heads 8-15
...
GPU 7: Q for ALL 1M tokens,  heads 56-63

Each GPU: [batch, 1M tokens, 8 heads, 128 dim]

Key insight: The total data doesn't change — it's just redistributed. Before: each GPU has 1/8 of tokens, all heads. After: each GPU has all tokens, 1/8 of heads. Same memory, different layout.

3.3.1 Why This Works

After the All-to-All, each GPU has the complete sequence for its assigned heads. Now it can compute full attention:

GPU 0 after All-to-All:
  Q: [1, 1M, 8, 128]   ← all tokens, heads 0-7
  K: [1, 1M, 8, 128]   ← all tokens, heads 0-7 (with GQA: 1 KV head)
  V: [1, 1M, 8, 128]   ← all tokens, heads 0-7 (with GQA: 1 KV head)

  Attention for head 0: softmax(Q₀ @ K₀ᵀ / √128) @ V₀
    → Q₀ is [1M, 128], K₀ is [1M, 128]
    → Q₀ @ K₀ᵀ = [1M, 1M]  ← full attention matrix for this head!
    → Uses FlashAttention, so never materialized

  Each GPU computes COMPLETE attention for its 8 heads. ✓

3.3.2 A Concrete Example with 8 GPUs

Let's trace through with our Llama-70B setup. We have 8 GPUs, 8 KV heads, and 1M tokens (125K per GPU):

SETUP:
  8 GPUs, 8 KV heads (GQA), 64 query heads
  Each GPU starts with 125K tokens, all 8 KV heads

BEFORE All-to-All (for K tensor — same logic applies to V):
────────────────────────────────────────────────────────────
GPU 0: tokens [0-124K],   KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
GPU 1: tokens [125K-249K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
GPU 2: tokens [250K-374K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
...
GPU 7: tokens [875K-999K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]

Each GPU has: [1, 125K, 8 heads, 128 dim]

Now the All-to-All redistributes. GPU 0 keeps its H0 data from all GPUs, GPU 1 keeps H1, etc.:

ALL-TO-ALL COMMUNICATION:
─────────────────────────
GPU 0 keeps H0 for its tokens, sends H1→GPU1, H2→GPU2, ... H7→GPU7
GPU 1 sends H0→GPU0, keeps H1 for its tokens, sends H2→GPU2, ... H7→GPU7
GPU 2 sends H0→GPU0, H1→GPU1, keeps H2 for its tokens, ... H7→GPU7
...
GPU 7 sends H0→GPU0, H1→GPU1, ... H6→GPU6, keeps H7 for its tokens

Each GPU sends 7/8 of its data and receives 7/8 from others.

AFTER All-to-All:
─────────────────
GPU 0: ALL tokens [0-999K], head H0 only
  K = | tokens 0-124K H0   |  ← kept locally
      | tokens 125K-249K H0 |  ← received from GPU 1
      | tokens 250K-374K H0 |  ← received from GPU 2
      | ...                  |
      | tokens 875K-999K H0 |  ← received from GPU 7

GPU 1: ALL tokens [0-999K], head H1 only
GPU 2: ALL tokens [0-999K], head H2 only
...
GPU 7: ALL tokens [0-999K], head H7 only

Each GPU has: [1, 1M, 1 head, 128 dim]

3.3.3 Memory Stays the Same

Let's verify that the total memory per GPU doesn't change:

BEFORE All-to-All (per GPU):
  Q: [1, 125K tokens, 64 heads, 128 dim] × 2 bytes
  Elements: 125,000 × 64 × 128 = 1,024,000,000
  Memory: 1,024,000,000 × 2 = 2.05 GB

AFTER All-to-All (per GPU):
  Q: [1, 1M tokens, 8 heads, 128 dim] × 2 bytes
  Elements: 1,000,000 × 8 × 128 = 1,024,000,000
  Memory: 1,024,000,000 × 2 = 2.05 GB

Same! 125K × 64 = 1M × 8 = 8,000,000 head-token pairs per GPU

For K and V (GQA — 8 KV heads):
  Before: [1, 125K, 8, 128] = 0.26 GB per GPU
  After:  [1, 1M, 1, 128]   = 0.26 GB per GPU
  Also the same!

All-to-All is a pure redistribution — no data is created or destroyed. It's like rearranging cards in a deck: the total number of cards stays the same, but they're organized differently.

3.4 The Full Ulysses Forward Pass

Now let's trace through an entire transformer layer with Ulysses. We'll track the tensor shapes on each GPU at every step.

3.4.1 Step 1: Input (Sequence-Sharded)

Each GPU starts with its chunk of the input:

Input per GPU: X_local
  Shape: [batch, seq_len/N, d_model] = [1, 125K, 8192]
  Memory: 125,000 × 8192 × 2 bytes = 2.05 GB

3.4.2 Step 2: Compute Q, K, V Locally

Each GPU projects its local tokens into Q, K, V using the full weight matrices (gathered via FSDP). Let's trace the shapes:

Q_local = X_local @ Wq
  [1, 125K, 8192] @ [8192, 8192] = [1, 125K, 8192]
  This gives us 64 query heads × 128 dim = 8192

K_local = X_local @ Wk
  [1, 125K, 8192] @ [8192, 1024] = [1, 125K, 1024]
  This gives us 8 KV heads × 128 dim = 1024 (GQA — smaller!)

V_local = X_local @ Wv
  [1, 125K, 8192] @ [8192, 1024] = [1, 125K, 1024]
  Same as K — 8 KV heads × 128 dim = 1024

Reshape for multi-head:
Q_local: [1, 125K, 64 heads, 128 dim]
K_local: [1, 125K, 8 heads, 128 dim]
V_local: [1, 125K, 8 heads, 128 dim]

Memory per GPU:
  Q: 125K × 8192 × 2 bytes = 2.05 GB
  K: 125K × 1024 × 2 bytes = 0.26 GB
  V: 125K × 1024 × 2 bytes = 0.26 GB
  Total: 2.57 GB

3.4.3 Step 3: All-to-All (Sequence → Head Split)

Redistribute Q, K, V so each GPU has all tokens for a subset of heads:

All-to-All on Q:
  Before: [1, 125K, 64 heads, 128]  per GPU (partial tokens, all heads)
  After:  [1, 1M,   8 heads,  128]  per GPU (all tokens, partial heads)

All-to-All on K (with GQA — 8 KV heads across 8 GPUs):
  Before: [1, 125K, 8 heads, 128]   per GPU
  After:  [1, 1M,   1 head,  128]   per GPU (each GPU gets 1 KV head)

All-to-All on V:
  Before: [1, 125K, 8 heads, 128]   per GPU
  After:  [1, 1M,   1 head,  128]   per GPU

Memory per GPU after All-to-All:
  Q: 1M × 8 × 128 × 2 = 2.05 GB
  K: 1M × 1 × 128 × 2 = 0.26 GB
  V: 1M × 1 × 128 × 2 = 0.26 GB
  Total: 2.57 GB  ← same as before!

3.4.4 Step 4: Compute Attention (Per Head)

Now each GPU computes full attention for its assigned heads. With GQA, each KV head is shared by 8 query heads:

GPU 0 has: Q for heads 0-7, K for KV-head 0, V for KV-head 0

For each query head (0 through 7):
  Attention_h = softmax(Q_h @ K₀ᵀ / √128) @ V₀
  
  Q_h: [1M, 128]
  K₀:  [1M, 128]
  
  Using FlashAttention (tiled, never materializes [1M, 1M]):
    Output_h: [1M, 128]

All 8 query heads share the same K₀, V₀ (this is GQA!)

Output per GPU: [1, 1M, 8 heads, 128 dim]

3.4.5 Step 5: All-to-All (Head → Sequence Split)

Now we need to reverse the redistribution — go back to sequence-sharded layout for the output projection and FFN:

All-to-All on attention output:
  Before: [1, 1M,   8 heads,  128]  per GPU (all tokens, partial heads)
  After:  [1, 125K, 64 heads, 128]  per GPU (partial tokens, all heads)

Reshape: [1, 125K, 8192]  (merge heads back into d_model)

3.4.6 Step 6: Output Projection + FFN (Local)

The rest of the layer is purely local — no communication needed:

Output projection (local):
  attn_out = reshaped_output @ Wo   → [1, 125K, 8192]

Add residual + LayerNorm (local)

FFN (local, per token):
  gate = x @ W_gate    → [1, 125K, 28672]
  up   = x @ W_up      → [1, 125K, 28672]
  hidden = SiLU(gate) * up
  output = hidden @ W_down  → [1, 125K, 8192]

Add residual + LayerNorm (local)

→ Pass to next layer (still sequence-sharded)

3.4.7 Complete Pipeline Summary

ULYSSES FORWARD PASS — ONE LAYER:
═════════════════════════════════

Step 1: Input                    [1, 125K, 8192]     ← sequence-sharded
Step 2: Compute Q, K, V         local matmuls        ← no communication
Step 3: All-to-All #1           seq→head split       ← COMMUNICATION
Step 4: Attention (FlashAttn)   per-head, full seq   ← no communication
Step 5: All-to-All #2           head→seq split       ← COMMUNICATION
Step 6: Output proj + FFN       local matmuls        ← no communication

Communication per layer: 2 × All-to-All
Everything else: purely local computation

Only 2 communication points per layer. The Q/K/V projections, attention computation, output projection, and entire FFN are all local. Communication is limited to the two All-to-All operations that reshape the data for attention.

3.5 Why Not Split by Heads? (Tensor Parallelism)

You might be wondering: if we end up splitting by heads for attention anyway, why not split by heads from the start? That's exactly what Tensor Parallelism (Megatron-style) does. Let's understand it in detail, then see why Ulysses is better.

3.5.1 Tensor Parallelism: Column Split and Row Split

Tensor parallelism splits weight matrices across GPUs. There are two ways to split a matrix, and each has different consequences:

3.5.2 Column Split: No Communication Needed

When we split a weight matrix by columns, each GPU computes a different slice of the output. Let's trace through with concrete numbers:

COLUMN SPLIT EXAMPLE:
─────────────────────
Full computation: Y = X @ W
  X: [4, 8]    W: [8, 6]    Y: [4, 6]

Split W by columns across 2 GPUs:
  GPU 0 gets W₀ = W[:, :3]   shape [8, 3]  (first 3 columns)
  GPU 1 gets W₁ = W[:, 3:]   shape [8, 3]  (last 3 columns)

Both GPUs have the FULL input X: [4, 8]

GPU 0 computes: Y₀ = X @ W₀ = [4, 3]   ← first 3 columns of Y
GPU 1 computes: Y₁ = X @ W₁ = [4, 3]   ← last 3 columns of Y

To get full Y: just concatenate!
  Y = [Y₀ | Y₁] = [4, 6]   ← NO COMMUNICATION NEEDED

Why does this work? Each column of W produces one column of Y. If GPU 0 has columns 0-2 of W, it produces columns 0-2 of Y. Simple concatenation gives the full result.

3.5.3 Row Split: AllReduce Required

When we split a weight matrix by rows, something different happens. Each GPU computes a partial sum of the output:

ROW SPLIT EXAMPLE:
──────────────────
Full computation: Y = X @ W
  X: [4, 8]    W: [8, 6]    Y: [4, 6]

Split W by rows across 2 GPUs:
  GPU 0 gets W₀ = W[:4, :]   shape [4, 6]  (first 4 rows)
  GPU 1 gets W₁ = W[4:, :]   shape [4, 6]  (last 4 rows)

Each GPU also needs a DIFFERENT slice of X:
  GPU 0 uses X₀ = X[:, :4]   shape [4, 4]  (first 4 columns)
  GPU 1 uses X₁ = X[:, 4:]   shape [4, 4]  (last 4 columns)

GPU 0 computes: Y₀ = X₀ @ W₀ = [4, 6]   ← PARTIAL result
GPU 1 computes: Y₁ = X₁ @ W₁ = [4, 6]   ← PARTIAL result

To get full Y: must SUM!
  Y = Y₀ + Y₁ = [4, 6]   ← REQUIRES ALLREDUCE

When you split the inner dimension (rows of W = columns of X), each GPU computes a partial dot product. You must sum them to get the correct result. This sum across GPUs is an AllReduce.

3.6 Tensor Parallelism in a Transformer Layer

Now let's see how column split and row split are used together in a real transformer layer. This is the Megatron-LM approach.

3.6.1 Attention Block with Tensor Parallelism

The attention block uses column split for Q, K, V projections. This is natural because each head is independent — column split gives each GPU its own heads:

ATTENTION BLOCK (Tensor Parallel, 8 GPUs):
══════════════════════════════════════════

Input X: [1, 1M, 8192]  ← REPLICATED on all GPUs

STEP 1: Q, K, V Projections (COLUMN SPLIT)
──────────────────────────────────────────
  Wq is [8192, 8192]. Split by columns into 8 shards:
    GPU 0: Wq₀ = Wq[:, :1024]     shape [8192, 1024]  → 8 heads
    GPU 1: Wq₁ = Wq[:, 1024:2048] shape [8192, 1024]  → 8 heads
    ...
    GPU 7: Wq₇ = Wq[:, 7168:]     shape [8192, 1024]  → 8 heads

  Each GPU computes:
    Q_local = X @ Wq_local   → [1, 1M, 1024]  (8 of 64 heads)

  Same for K and V (column split).
  NO COMMUNICATION needed — column split produces independent slices!

STEP 2: Attention Computation (LOCAL)
─────────────────────────────────────
  Each GPU computes attention for its 8 heads. Heads are independent.

Now here's where it gets interesting. After attention, each GPU has an output of shape [1, 1M, 1024] — only 1/8 of the full hidden dimension. We need to project this back to the full d_model = 8192 using the output projection Wo.

Why is Wo a row split? Because of dimensional compatibility:

STEP 3: Output Projection — WHY ROW SPLIT?
──────────────────────────────────────────
  After attention, each GPU has: attn_out_local = [1, 1M, 1024]
  Wo is [8192, 8192]. We need to map back to d_model = 8192.

  Option A — Column split Wo:
    GPU 0 would get Wo[:, :1024] = [8192, 1024]
    Input needed: [1, 1M, 8192]  ← GPU 0 only has [1, 1M, 1024]!
    DOESN'T WORK without gathering first.

  Option B — Row split Wo (what we actually do):
    GPU 0 gets Wo[:1024, :] = [1024, 8192]
    Input needed: [1, 1M, 1024]  ← GPU 0 already has this! ✓

  So row split is the natural choice:
    GPU 0: O₀ = attn_out₀ @ Wo₀   → [1, 1M, 8192]  (PARTIAL result)
    GPU 1: O₁ = attn_out₁ @ Wo₁   → [1, 1M, 8192]  (PARTIAL result)
    ...
    GPU 7: O₇ = attn_out₇ @ Wo₇   → [1, 1M, 8192]  (PARTIAL result)

  Full output = O₀ + O₁ + O₂ + ... + O₇

  ⚠️  ALLREDUCE #1: Sum [1, 1M, 8192] across 8 GPUs
  Data: 1M × 8192 × 2 bytes = 16.38 GB

The column→row pattern: Column split on Q, K, V produces partial hidden dimensions that naturally feed into row split on Wo. The AllReduce at the end is the unavoidable cost of recombining the partial results.

3.6.2 FFN Block with Tensor Parallelism

The FFN block follows the same pattern — column split for the up-projections, row split for the down-projection:

FFN BLOCK (Tensor Parallel, 8 GPUs, SwiGLU):
═════════════════════════════════════════════

Input: [1, 1M, 8192]  ← result from attention AllReduce (replicated)

STEP 4: Gate and Up Projections (COLUMN SPLIT)
───────────────────────────────────────────────
  W_gate is [8192, 28672]. Split by columns:
    GPU 0: W_gate₀ = W_gate[:, :3584]  shape [8192, 3584]
    ...

  Each GPU computes:
    gate_local = X @ W_gate_local   → [1, 1M, 3584]
    up_local   = X @ W_up_local     → [1, 1M, 3584]
    hidden_local = SiLU(gate_local) * up_local  → [1, 1M, 3584]

  NO COMMUNICATION — column split!

STEP 5: Down Projection (ROW SPLIT) ← PROBLEM AGAIN
─────────────────────────────────────────────────────
  W_down is [28672, 8192]. Split by ROWS:
    GPU 0: W_down₀ = W_down[:3584, :]  shape [3584, 8192]
    ...

  Each GPU computes:
    ffn_out_local = hidden_local @ W_down_local  → [1, 1M, 8192]  (PARTIAL!)

  To get the correct output:
    ffn_out = ffn_out₀ + ffn_out₁ + ... + ffn_out₇

  ⚠️  ALLREDUCE #2: Sum [1, 1M, 8192] across 8 GPUs
  Data: 1M × 8192 × 2 bytes = 16.38 GB

Tensor Parallelism requires 2 AllReduces per layer. One after the attention output projection, one after the FFN down projection. Each AllReduce communicates the full hidden state: [1, 1M, 8192] = 16.38 GB.

3.7 Ulysses vs Tensor Parallelism: The Numbers

Now let's do the detailed math to compare communication costs. This is where Ulysses really shines.

3.7.1 Tensor Parallelism Communication Cost

Each AllReduce needs to sum partial results across 8 GPUs and give every GPU the full sum. The efficient way to do this is in two phases:

HOW ALLREDUCE WORKS (for the output projection sum):
════════════════════════════════════════════════════

Each GPU has a partial result: O_local = [1, 1M, 8192]  (16.38 GB)
We need: O_full = O₀ + O₁ + O₂ + ... + O₇ on EVERY GPU.

NAIVE APPROACH (Reduce + Broadcast):
  Step 1: Send all data to GPU 0, sum there → GPU 0 has O_full
  Step 2: GPU 0 broadcasts O_full to everyone
  Problem: Only 1 GPU active at a time. Sequential. Slow.

EFFICIENT APPROACH (ReduceScatter + AllGather):
  Phase 1 — ReduceScatter:
    Split each GPU's 16.38 GB into 8 chunks of 2.05 GB.
    GPU 0 is responsible for summing chunk 0 from all GPUs.
    GPU 1 is responsible for summing chunk 1 from all GPUs.
    ...
    All GPUs work in PARALLEL — each one sums a different chunk.
    
    After N-1 ring steps:
      GPU 0 has: fully summed chunk 0 (2.05 GB)
      GPU 1 has: fully summed chunk 1 (2.05 GB)
      ...
    
    Data sent per GPU: 16.38 × (N-1)/N = 16.38 × 7/8 = 14.33 GB

  Phase 2 — AllGather:
    Now each GPU shares its summed chunk with all others.
    (Same as FSDP's AllGather from Part 2!)
    
    After N-1 ring steps:
      Every GPU has all 8 summed chunks = full O_full (16.38 GB)
    
    Data sent per GPU: 14.33 GB

  Total per AllReduce: 14.33 + 14.33 = 28.67 GB per GPU

Now let's calculate the full cost:

TENSOR PARALLELISM — Communication per layer:
══════════════════════════════════════════════

AllReduce #1 (after attention output projection):
  Data: [1, 1M, 8192] × 2 bytes = 16.38 GB
  Volume per GPU: 28.67 GB

AllReduce #2 (after FFN down projection):
  Data: [1, 1M, 8192] × 2 bytes = 16.38 GB
  Volume per GPU: 28.67 GB

Total per layer: 28.67 + 28.67 = 57.34 GB
Total for 80 layers: 57.34 × 80 = 4,587 GB ≈ 4.5 TB

Time (NVLink 300 GB/s per direction):
  Per AllReduce: 28.67 GB ÷ 300 GB/s = 95.6 ms
  Per layer: 2 × 95.6 = 191.2 ms
  Total (80 layers): 191.2 × 80 = 15.3 seconds

3.7.2 Ulysses Communication Cost

ULYSSES — Communication per layer:
══════════════════════════════════

All-to-All #1 (before attention, on Q, K, V):
  Q per GPU before: [1, 125K, 64, 128] × 2 bytes = 2.05 GB
  K per GPU before: [1, 125K, 8, 128] × 2 bytes = 0.26 GB
  V per GPU before: [1, 125K, 8, 128] × 2 bytes = 0.26 GB
  
  Each GPU sends (N-1)/N of its data:
  Q sent: 2.05 × 7/8 = 1.79 GB
  K sent: 0.26 × 7/8 = 0.22 GB
  V sent: 0.26 × 7/8 = 0.22 GB
  Total sent: 2.23 GB

All-to-All #2 (after attention, on output):
  Output per GPU: [1, 1M, 8, 128] × 2 bytes = 2.05 GB
  Sent: 2.05 × 7/8 = 1.79 GB

Total per layer: 2.23 + 1.79 = 4.02 GB
Total for 80 layers: 4.02 × 80 = 322 GB

Time (NVLink 300 GB/s per direction):
  Per All-to-All: ~2 GB ÷ 300 GB/s = 6.7 ms
  Per layer: 2 × 6.7 = 13.4 ms
  Total (80 layers): 13.4 × 80 = 1.07 seconds

3.7.3 Side-by-Side Comparison

Metric	Tensor Parallelism	Ulysses	Ratio
Comm ops per layer	2 × AllReduce	2 × All-to-All	—
Data per layer	57.3 GB	4.0 GB	14× less
Total (80 layers)	4,587 GB	322 GB	14× less
Time per layer	191 ms	13.4 ms	14× faster
Total time	15.3 sec	1.07 sec	14× faster
FFN communication	Required (AllReduce)	None (local)	∞
Activation memory per GPU	20.5 GB (full sequence)	2.6 GB (1/8 sequence)	8× less

3.7.4 Why is Ulysses 14× Better?

The difference comes from what gets communicated:

TENSOR PARALLELISM communicates:
  The full hidden state: [1, 1M, 8192] = 16.38 GB
  This is the FULL d_model dimension for ALL tokens.
  And it does this TWICE per layer (attention + FFN).

ULYSSES communicates:
  Q, K, V tensors: Q(2.05 GB) + K(0.26 GB) + V(0.26 GB) = 2.57 GB
  This is much smaller because:
  1. K and V are 8× smaller due to GQA (8 heads vs 64)
  2. FFN requires ZERO communication (tokens are independent)
  3. All-to-All is a pure redistribution (no reduction/summation)

Ulysses wins because: (1) GQA makes K, V much smaller than the full hidden state, (2) FFN is embarrassingly parallel across the sequence dimension — no communication needed, and (3) All-to-All is a pure shuffle, not a reduction.

3.7.5 When Does Tensor Parallelism Win?

Tensor parallelism isn't always worse. The key factor is latency vs bandwidth. Let's see a concrete example:

VERY SHORT SEQUENCE: seq_len = 128 tokens (single prompt, decode phase)
═══════════════════════════════════════════════════════════════════════

TENSOR PARALLELISM:
  AllReduce data: [1, 128, 8192] × 2 bytes = 2.1 MB
  Per AllReduce: 2 × 2.1 MB × 7/8 = 3.7 MB
  Per layer: 2 × 3.7 MB = 7.4 MB
  Time per layer: 7.4 MB ÷ 300 GB/s = 0.025 ms
  BUT: AllReduce startup latency ≈ 5 μs × 2 = 10 μs = 0.01 ms
  Effective time per layer: ~0.035 ms

ULYSSES:
  Q per GPU: [1, 16, 64, 128] × 2 bytes = 0.26 MB
  K per GPU: [1, 16, 8, 128] × 2 bytes = 0.033 MB
  V per GPU: [1, 16, 8, 128] × 2 bytes = 0.033 MB
  All-to-All data: 0.33 MB × 7/8 = 0.29 MB
  Per layer: 2 × 0.29 MB = 0.58 MB
  Time per layer: 0.58 MB ÷ 300 GB/s = 0.002 ms
  BUT: All-to-All startup latency ≈ 15 μs × 2 = 30 μs = 0.03 ms
  Effective time per layer: ~0.032 ms

At 128 tokens, they're roughly equal! The data is so small that
startup latency dominates, not bandwidth.

The real advantage of TP at short sequences is simplicity — no need to shard/unshard the sequence, and it's battle-tested in production systems like vLLM and TensorRT-LLM.

ALSO: Models with very few KV heads
════════════════════════════════════
  Some models use Multi-Query Attention (MQA): only 1 KV head.
  
  Ulysses with 1 KV head: max 1 GPU for KV All-to-All → can't parallelize!
  TP with 1 KV head: works fine — splits by query heads instead.

RULE OF THUMB:
  Long sequences (>1K tokens) + enough KV heads (≥ N GPUs) → Ulysses
  Short sequences (<1K tokens) OR too few KV heads → Tensor Parallelism

For our use case — Llama-70B with 1M tokens and 8 KV heads on 8 GPUs — Ulysses is the clear winner by a wide margin.

Communication budget: FSDP adds ~376 ms for weight gathering (from Part 2). Ulysses adds ~1.1 seconds for All-to-All. Total: ~1.5 seconds of communication per forward pass. With pipelining (overlapping communication and compute), much of this can be hidden.

3.8 Limitations of Ulysses

Ulysses is powerful, but it has important constraints.

3.8.1 Limited by Number of Heads

After All-to-All, each GPU needs at least one head. This limits the number of GPUs:

Llama-70B head counts:
  Query heads: 64
  KV heads: 8 (GQA)

Maximum GPUs for Ulysses:
  If splitting by query heads: max 64 GPUs
  If splitting by KV heads: max 8 GPUs  ← this is the real limit!

Why KV heads are the bottleneck:
  After All-to-All, each GPU needs at least 1 KV head.
  With 8 KV heads and 8 GPUs: each GPU gets exactly 1 KV head. ✓
  With 8 KV heads and 16 GPUs: each GPU gets 0.5 KV heads. ✗ Can't split!

GQA limits Ulysses scalability. Llama-70B's 8 KV heads mean Ulysses can use at most 8 GPUs. To scale beyond 8 GPUs, we need a different approach for the sequence dimension — that's Ring Attention.

3.8.2 KV Cache Still Grows Linearly

Ulysses reduces activation memory during computation, but the KV cache for inference still needs to store all tokens:

KV Cache with Ulysses (8 GPUs):
  After All-to-All, each GPU has full sequence for its KV heads.
  
  Per GPU: [1, 1M, 1 KV head, 128] × 2 (K+V) × 80 layers × 2 bytes
         = 1M × 1 × 128 × 2 × 80 × 2
         = 41 GB per GPU

  Total across 8 GPUs: 41 × 8 = 328 GB (same total as before)

Let's check: does everything actually fit on one A100?

FULL MEMORY BUDGET PER GPU (Ulysses + FSDP):
═════════════════════════════════════════════

Weight shard (FSDP):           17.5 GB
KV cache (1 KV head, 80 layers): 41.0 GB
Activations (Q,K,V for 125K):    2.6 GB
Buffers & overhead:               ~2.0 GB
─────────────────────────────────────────
Total:                          63.1 GB

A100 capacity:                  80.0 GB
Headroom:                       16.9 GB  ✓

It fits! With FSDP + Ulysses, each GPU uses ~63 GB of its 80 GB budget. The 17 GB headroom is enough for temporary buffers, FlashAttention workspace, and CUDA overhead.

3.8.3 All-to-All Requires All GPUs to Communicate

Unlike Ring AllGather (where each GPU only talks to its neighbors), All-to-All requires every GPU to send data to every other GPU simultaneously:

All-to-All communication pattern:
  - Every GPU sends to every other GPU
  - Requires full bisection bandwidth
  - Works well within a single node (NVLink)
  - Becomes expensive across nodes

WITHIN A NODE (8 GPUs, NVLink):
  Bandwidth: 300 GB/s per direction
  All-to-All time per layer: ~13.5 ms
  Latency: ~0.5 μs per message

ACROSS NODES (InfiniBand):
  Bandwidth: ~25 GB/s per direction (HDR InfiniBand)
  All-to-All time per layer: ~13.5 ms × (300/25) ≈ 162 ms
  Latency: ~1-5 μs per message

  80 layers: 162 ms × 80 = 13 seconds (vs 1.1 seconds intra-node)
  That's 12× slower!

This is why Ulysses is typically used within a single NVLink-connected node. For scaling beyond one node, Ring Attention (Part 4) is a better fit — it only communicates with neighbors, not all GPUs.

3.9 Summary

Key Concepts

Concept	What it does	Key insight
Sequence Sharding	Split tokens 1/N per GPU	Reduces activation memory from 20.5 GB to 2.6 GB
All-to-All	Redistribute seq↔head split	Pure shuffle — no data created or destroyed
Column Split	Split weight columns	Produces partial output — just concatenate
Row Split	Split weight rows	Produces partial sums — needs AllReduce
Ulysses vs TP	All-to-All vs AllReduce	14× less communication for long sequences

Key Numbers (Llama-70B, 8 GPUs, 1M tokens)

Activation memory per GPU	~2.6 GB (down from 20.5 GB)
All-to-All time per layer	~13.5 ms (two All-to-Alls)
Total Ulysses comm (80 layers)	~1.1 seconds
TP AllReduce equivalent	~15.3 seconds (14× worse)
Max GPUs (KV head limited)	8 (due to GQA)
KV cache per GPU	~41 GB

Ulysses solves Problem #2: Activation memory drops from 20.5 GB to 2.6 GB per GPU. Combined with FSDP (17.5 GB weights), each GPU uses ~20 GB for weights + activations — well within the 80 GB budget. But the KV cache (41 GB per GPU) is still a challenge, and we're limited to 8 GPUs by GQA's head count. To scale further and handle the KV cache, we need Ring Attention.

What's Next

← Part 2: Weight Sharding (FSDP)	How we distributed 140 GB of weights
Part 4: Ring Attention	Distributed attention without gathering the full sequence — scaling beyond head count
Part 5: Putting It Together	The complete USP picture: FSDP + Ulysses + Ring Attention

Weights are sharded. Sequence is sharded. Now let's shard the attention itself.