← Part 2: Weight Sharding (FSDP)
Splitting 1M tokens across 8 GPUs with All-to-All
In Part 2, we solved the weight problem: FSDP distributes 140 GB of weights across 8 GPUs at 17.5 GB each. But we still have two unsolved problems from Part 1:
FSDP shards weights. Can we shard the sequence instead? That's exactly what Ulysses does. In this part, we'll build up the idea step by step:
The idea is simple: instead of every GPU processing all 1M tokens, each GPU processes only 1/N of the tokens.
With 8 GPUs and 1M tokens, each GPU gets a contiguous chunk:
Total sequence: 1,000,000 tokens
Number of GPUs: 8
GPU 0: tokens [0, 124,999] → 125,000 tokens
GPU 1: tokens [125,000, 249,999] → 125,000 tokens
GPU 2: tokens [250,000, 374,999] → 125,000 tokens
...
GPU 7: tokens [875,000, 999,999] → 125,000 tokens
Each GPU now computes Q, K, V only for its local 125K tokens:
BEFORE (single GPU, full sequence):
Q: [1, 1,000,000, 8192] × 2 bytes = 16.38 GB (64 query heads)
K: [1, 1,000,000, 1024] × 2 bytes = 2.05 GB (8 KV heads, GQA)
V: [1, 1,000,000, 1024] × 2 bytes = 2.05 GB (8 KV heads, GQA)
Total: 20.48 GB ← on top of weights that already don't fit!
AFTER (8 GPUs, sharded sequence):
Q: [1, 125,000, 8192] × 2 bytes = 2.05 GB
K: [1, 125,000, 1024] × 2 bytes = 0.26 GB
V: [1, 125,000, 1024] × 2 bytes = 0.26 GB
Total per GPU: 2.56 GB ← fits easily!
The feed-forward network (FFN) processes each token independently — there's no interaction between tokens. This means sequence sharding works perfectly for FFN with zero communication:
FFN computation (per token, independent):
hidden = SiLU(x @ W_gate) * (x @ W_up)
output = hidden @ W_down
GPU 0 processes tokens 0-124,999 through FFN → no communication needed
GPU 1 processes tokens 125,000-249,999 through FFN → no communication needed
...
GPU 7 processes tokens 875,000-999,999 through FFN → no communication needed
FFN is "embarrassingly parallel" across the sequence dimension. Each GPU applies the same weights to its local tokens independently. This is a huge advantage of sequence sharding.
But there's a problem. The FFN is fine — but what about attention?
Attention is fundamentally different from FFN. In attention, every token needs to attend to every other token:
Attention(Q, K, V) = softmax(Q @ Kᵀ / √d) @ V
For token 0 to compute its attention output:
- It needs Q₀ (its own query) ← GPU 0 has this
- It needs K for ALL 1M tokens ← spread across 8 GPUs!
- It needs V for ALL 1M tokens ← spread across 8 GPUs!
If each GPU only uses its local K and V, we get partial attention — each token only attends to tokens on the same GPU:
LOCAL-ONLY ATTENTION (WRONG):
─────────────────────────────
GPU 0: tokens 0-124,999 attend to tokens 0-124,999 only
GPU 1: tokens 125,000-249,999 attend to tokens 125,000-249,999 only
...
Token 0 CANNOT see token 999,999!
Token 124,999 CANNOT see token 125,000 (its immediate neighbor)!
This is fundamentally wrong. The whole point of attention is that any token can attend to any other token. Local-only attention would be like reading a book but only being able to reference the current chapter — you'd lose all cross-chapter context.
Let's visualize what the attention matrix looks like with sequence sharding:
FULL ATTENTION MATRIX (what we need):
─────────────────────────────────────
K₀ K₁ K₂ K₃ K₄ K₅ K₆ K₇
(GPU0) (GPU1) (GPU2) (GPU3) (GPU4) (GPU5) (GPU6) (GPU7)
Q₀ (GPU0) [████ ████ ████ ████ ████ ████ ████ ████]
Q₁ (GPU1) [████ ████ ████ ████ ████ ████ ████ ████]
Q₂ (GPU2) [████ ████ ████ ████ ████ ████ ████ ████]
Q₃ (GPU3) [████ ████ ████ ████ ████ ████ ████ ████]
Q₄ (GPU4) [████ ████ ████ ████ ████ ████ ████ ████]
Q₅ (GPU5) [████ ████ ████ ████ ████ ████ ████ ████]
Q₆ (GPU6) [████ ████ ████ ████ ████ ████ ████ ████]
Q₇ (GPU7) [████ ████ ████ ████ ████ ████ ████ ████]
LOCAL-ONLY ATTENTION (what each GPU can compute):
─────────────────────────────────────────────────
K₀ K₁ K₂ K₃ K₄ K₅ K₆ K₇
Q₀ (GPU0) [████ ---- ---- ---- ---- ---- ---- ----]
Q₁ (GPU1) [---- ████ ---- ---- ---- ---- ---- ----]
Q₂ (GPU2) [---- ---- ████ ---- ---- ---- ---- ----]
...
Q₇ (GPU7) [---- ---- ---- ---- ---- ---- ---- ████]
Only the diagonal blocks! We're missing 7/8 of the attention matrix.
We need a way for each GPU to compute attention across the full sequence. There are two approaches:
Ulysses takes approach #2 with a clever trick: the All-to-All operation.
Here's Ulysses's key insight: we don't need every GPU to have all K and V. We just need each GPU to compute complete attention for some heads.
Remember, multi-head attention splits the computation into independent heads:
Llama-70B has 64 query heads.
Each head computes attention independently:
Head 0: Attention(Q_head0, K_head0, V_head0)
Head 1: Attention(Q_head1, K_head1, V_head1)
...
Head 63: Attention(Q_head63, K_head63, V_head63)
These heads don't interact with each other during attention!
So instead of each GPU having all heads for partial tokens, we can rearrange so each GPU has partial heads for all tokens:
BEFORE All-to-All (sequence-sharded):
─────────────────────────────────────
GPU 0: Q for tokens 0-124K, ALL 64 heads
GPU 1: Q for tokens 125K-249K, ALL 64 heads
...
GPU 7: Q for tokens 875K-999K, ALL 64 heads
Each GPU: [batch, 125K tokens, 64 heads, 128 dim]
AFTER All-to-All (head-sharded):
────────────────────────────────
GPU 0: Q for ALL 1M tokens, heads 0-7
GPU 1: Q for ALL 1M tokens, heads 8-15
...
GPU 7: Q for ALL 1M tokens, heads 56-63
Each GPU: [batch, 1M tokens, 8 heads, 128 dim]
After the All-to-All, each GPU has the complete sequence for its assigned heads. Now it can compute full attention:
GPU 0 after All-to-All:
Q: [1, 1M, 8, 128] ← all tokens, heads 0-7
K: [1, 1M, 8, 128] ← all tokens, heads 0-7 (with GQA: 1 KV head)
V: [1, 1M, 8, 128] ← all tokens, heads 0-7 (with GQA: 1 KV head)
Attention for head 0: softmax(Q₀ @ K₀ᵀ / √128) @ V₀
→ Q₀ is [1M, 128], K₀ is [1M, 128]
→ Q₀ @ K₀ᵀ = [1M, 1M] ← full attention matrix for this head!
→ Uses FlashAttention, so never materialized
Each GPU computes COMPLETE attention for its 8 heads. ✓
Let's trace through with our Llama-70B setup. We have 8 GPUs, 8 KV heads, and 1M tokens (125K per GPU):
SETUP:
8 GPUs, 8 KV heads (GQA), 64 query heads
Each GPU starts with 125K tokens, all 8 KV heads
BEFORE All-to-All (for K tensor — same logic applies to V):
────────────────────────────────────────────────────────────
GPU 0: tokens [0-124K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
GPU 1: tokens [125K-249K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
GPU 2: tokens [250K-374K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
...
GPU 7: tokens [875K-999K], KV heads [H0, H1, H2, H3, H4, H5, H6, H7]
Each GPU has: [1, 125K, 8 heads, 128 dim]
Now the All-to-All redistributes. GPU 0 keeps its H0 data from all GPUs, GPU 1 keeps H1, etc.:
ALL-TO-ALL COMMUNICATION:
─────────────────────────
GPU 0 keeps H0 for its tokens, sends H1→GPU1, H2→GPU2, ... H7→GPU7
GPU 1 sends H0→GPU0, keeps H1 for its tokens, sends H2→GPU2, ... H7→GPU7
GPU 2 sends H0→GPU0, H1→GPU1, keeps H2 for its tokens, ... H7→GPU7
...
GPU 7 sends H0→GPU0, H1→GPU1, ... H6→GPU6, keeps H7 for its tokens
Each GPU sends 7/8 of its data and receives 7/8 from others.
AFTER All-to-All:
─────────────────
GPU 0: ALL tokens [0-999K], head H0 only
K = | tokens 0-124K H0 | ← kept locally
| tokens 125K-249K H0 | ← received from GPU 1
| tokens 250K-374K H0 | ← received from GPU 2
| ... |
| tokens 875K-999K H0 | ← received from GPU 7
GPU 1: ALL tokens [0-999K], head H1 only
GPU 2: ALL tokens [0-999K], head H2 only
...
GPU 7: ALL tokens [0-999K], head H7 only
Each GPU has: [1, 1M, 1 head, 128 dim]
Let's verify that the total memory per GPU doesn't change:
BEFORE All-to-All (per GPU):
Q: [1, 125K tokens, 64 heads, 128 dim] × 2 bytes
Elements: 125,000 × 64 × 128 = 1,024,000,000
Memory: 1,024,000,000 × 2 = 2.05 GB
AFTER All-to-All (per GPU):
Q: [1, 1M tokens, 8 heads, 128 dim] × 2 bytes
Elements: 1,000,000 × 8 × 128 = 1,024,000,000
Memory: 1,024,000,000 × 2 = 2.05 GB
Same! 125K × 64 = 1M × 8 = 8,000,000 head-token pairs per GPU
For K and V (GQA — 8 KV heads):
Before: [1, 125K, 8, 128] = 0.26 GB per GPU
After: [1, 1M, 1, 128] = 0.26 GB per GPU
Also the same!
All-to-All is a pure redistribution — no data is created or destroyed. It's like rearranging cards in a deck: the total number of cards stays the same, but they're organized differently.
Now let's trace through an entire transformer layer with Ulysses. We'll track the tensor shapes on each GPU at every step.
Each GPU starts with its chunk of the input:
Input per GPU: X_local
Shape: [batch, seq_len/N, d_model] = [1, 125K, 8192]
Memory: 125,000 × 8192 × 2 bytes = 2.05 GB
Each GPU projects its local tokens into Q, K, V using the full weight matrices (gathered via FSDP). Let's trace the shapes:
Q_local = X_local @ Wq
[1, 125K, 8192] @ [8192, 8192] = [1, 125K, 8192]
This gives us 64 query heads × 128 dim = 8192
K_local = X_local @ Wk
[1, 125K, 8192] @ [8192, 1024] = [1, 125K, 1024]
This gives us 8 KV heads × 128 dim = 1024 (GQA — smaller!)
V_local = X_local @ Wv
[1, 125K, 8192] @ [8192, 1024] = [1, 125K, 1024]
Same as K — 8 KV heads × 128 dim = 1024
Reshape for multi-head:
Q_local: [1, 125K, 64 heads, 128 dim]
K_local: [1, 125K, 8 heads, 128 dim]
V_local: [1, 125K, 8 heads, 128 dim]
Memory per GPU:
Q: 125K × 8192 × 2 bytes = 2.05 GB
K: 125K × 1024 × 2 bytes = 0.26 GB
V: 125K × 1024 × 2 bytes = 0.26 GB
Total: 2.57 GB
Redistribute Q, K, V so each GPU has all tokens for a subset of heads:
All-to-All on Q:
Before: [1, 125K, 64 heads, 128] per GPU (partial tokens, all heads)
After: [1, 1M, 8 heads, 128] per GPU (all tokens, partial heads)
All-to-All on K (with GQA — 8 KV heads across 8 GPUs):
Before: [1, 125K, 8 heads, 128] per GPU
After: [1, 1M, 1 head, 128] per GPU (each GPU gets 1 KV head)
All-to-All on V:
Before: [1, 125K, 8 heads, 128] per GPU
After: [1, 1M, 1 head, 128] per GPU
Memory per GPU after All-to-All:
Q: 1M × 8 × 128 × 2 = 2.05 GB
K: 1M × 1 × 128 × 2 = 0.26 GB
V: 1M × 1 × 128 × 2 = 0.26 GB
Total: 2.57 GB ← same as before!
Now each GPU computes full attention for its assigned heads. With GQA, each KV head is shared by 8 query heads:
GPU 0 has: Q for heads 0-7, K for KV-head 0, V for KV-head 0
For each query head (0 through 7):
Attention_h = softmax(Q_h @ K₀ᵀ / √128) @ V₀
Q_h: [1M, 128]
K₀: [1M, 128]
Using FlashAttention (tiled, never materializes [1M, 1M]):
Output_h: [1M, 128]
All 8 query heads share the same K₀, V₀ (this is GQA!)
Output per GPU: [1, 1M, 8 heads, 128 dim]
Now we need to reverse the redistribution — go back to sequence-sharded layout for the output projection and FFN:
All-to-All on attention output:
Before: [1, 1M, 8 heads, 128] per GPU (all tokens, partial heads)
After: [1, 125K, 64 heads, 128] per GPU (partial tokens, all heads)
Reshape: [1, 125K, 8192] (merge heads back into d_model)
The rest of the layer is purely local — no communication needed:
Output projection (local):
attn_out = reshaped_output @ Wo → [1, 125K, 8192]
Add residual + LayerNorm (local)
FFN (local, per token):
gate = x @ W_gate → [1, 125K, 28672]
up = x @ W_up → [1, 125K, 28672]
hidden = SiLU(gate) * up
output = hidden @ W_down → [1, 125K, 8192]
Add residual + LayerNorm (local)
→ Pass to next layer (still sequence-sharded)
ULYSSES FORWARD PASS — ONE LAYER:
═════════════════════════════════
Step 1: Input [1, 125K, 8192] ← sequence-sharded
Step 2: Compute Q, K, V local matmuls ← no communication
Step 3: All-to-All #1 seq→head split ← COMMUNICATION
Step 4: Attention (FlashAttn) per-head, full seq ← no communication
Step 5: All-to-All #2 head→seq split ← COMMUNICATION
Step 6: Output proj + FFN local matmuls ← no communication
Communication per layer: 2 × All-to-All
Everything else: purely local computation
You might be wondering: if we end up splitting by heads for attention anyway, why not split by heads from the start? That's exactly what Tensor Parallelism (Megatron-style) does. Let's understand it in detail, then see why Ulysses is better.
Tensor parallelism splits weight matrices across GPUs. There are two ways to split a matrix, and each has different consequences:
When we split a weight matrix by columns, each GPU computes a different slice of the output. Let's trace through with concrete numbers:
COLUMN SPLIT EXAMPLE:
─────────────────────
Full computation: Y = X @ W
X: [4, 8] W: [8, 6] Y: [4, 6]
Split W by columns across 2 GPUs:
GPU 0 gets W₀ = W[:, :3] shape [8, 3] (first 3 columns)
GPU 1 gets W₁ = W[:, 3:] shape [8, 3] (last 3 columns)
Both GPUs have the FULL input X: [4, 8]
GPU 0 computes: Y₀ = X @ W₀ = [4, 3] ← first 3 columns of Y
GPU 1 computes: Y₁ = X @ W₁ = [4, 3] ← last 3 columns of Y
To get full Y: just concatenate!
Y = [Y₀ | Y₁] = [4, 6] ← NO COMMUNICATION NEEDED
Why does this work? Each column of W produces one column of Y. If GPU 0 has columns 0-2 of W, it produces columns 0-2 of Y. Simple concatenation gives the full result.
When we split a weight matrix by rows, something different happens. Each GPU computes a partial sum of the output:
ROW SPLIT EXAMPLE:
──────────────────
Full computation: Y = X @ W
X: [4, 8] W: [8, 6] Y: [4, 6]
Split W by rows across 2 GPUs:
GPU 0 gets W₀ = W[:4, :] shape [4, 6] (first 4 rows)
GPU 1 gets W₁ = W[4:, :] shape [4, 6] (last 4 rows)
Each GPU also needs a DIFFERENT slice of X:
GPU 0 uses X₀ = X[:, :4] shape [4, 4] (first 4 columns)
GPU 1 uses X₁ = X[:, 4:] shape [4, 4] (last 4 columns)
GPU 0 computes: Y₀ = X₀ @ W₀ = [4, 6] ← PARTIAL result
GPU 1 computes: Y₁ = X₁ @ W₁ = [4, 6] ← PARTIAL result
To get full Y: must SUM!
Y = Y₀ + Y₁ = [4, 6] ← REQUIRES ALLREDUCE
When you split the inner dimension (rows of W = columns of X), each GPU computes a partial dot product. You must sum them to get the correct result. This sum across GPUs is an AllReduce.
Now let's see how column split and row split are used together in a real transformer layer. This is the Megatron-LM approach.
The attention block uses column split for Q, K, V projections. This is natural because each head is independent — column split gives each GPU its own heads:
ATTENTION BLOCK (Tensor Parallel, 8 GPUs):
══════════════════════════════════════════
Input X: [1, 1M, 8192] ← REPLICATED on all GPUs
STEP 1: Q, K, V Projections (COLUMN SPLIT)
──────────────────────────────────────────
Wq is [8192, 8192]. Split by columns into 8 shards:
GPU 0: Wq₀ = Wq[:, :1024] shape [8192, 1024] → 8 heads
GPU 1: Wq₁ = Wq[:, 1024:2048] shape [8192, 1024] → 8 heads
...
GPU 7: Wq₇ = Wq[:, 7168:] shape [8192, 1024] → 8 heads
Each GPU computes:
Q_local = X @ Wq_local → [1, 1M, 1024] (8 of 64 heads)
Same for K and V (column split).
NO COMMUNICATION needed — column split produces independent slices!
STEP 2: Attention Computation (LOCAL)
─────────────────────────────────────
Each GPU computes attention for its 8 heads. Heads are independent.
Now here's where it gets interesting. After attention, each GPU has an output of shape [1, 1M, 1024] — only 1/8 of the full hidden dimension. We need to project this back to the full d_model = 8192 using the output projection Wo.
Why is Wo a row split? Because of dimensional compatibility:
STEP 3: Output Projection — WHY ROW SPLIT?
──────────────────────────────────────────
After attention, each GPU has: attn_out_local = [1, 1M, 1024]
Wo is [8192, 8192]. We need to map back to d_model = 8192.
Option A — Column split Wo:
GPU 0 would get Wo[:, :1024] = [8192, 1024]
Input needed: [1, 1M, 8192] ← GPU 0 only has [1, 1M, 1024]!
DOESN'T WORK without gathering first.
Option B — Row split Wo (what we actually do):
GPU 0 gets Wo[:1024, :] = [1024, 8192]
Input needed: [1, 1M, 1024] ← GPU 0 already has this! ✓
So row split is the natural choice:
GPU 0: O₀ = attn_out₀ @ Wo₀ → [1, 1M, 8192] (PARTIAL result)
GPU 1: O₁ = attn_out₁ @ Wo₁ → [1, 1M, 8192] (PARTIAL result)
...
GPU 7: O₇ = attn_out₇ @ Wo₇ → [1, 1M, 8192] (PARTIAL result)
Full output = O₀ + O₁ + O₂ + ... + O₇
⚠️ ALLREDUCE #1: Sum [1, 1M, 8192] across 8 GPUs
Data: 1M × 8192 × 2 bytes = 16.38 GB
The FFN block follows the same pattern — column split for the up-projections, row split for the down-projection:
FFN BLOCK (Tensor Parallel, 8 GPUs, SwiGLU):
═════════════════════════════════════════════
Input: [1, 1M, 8192] ← result from attention AllReduce (replicated)
STEP 4: Gate and Up Projections (COLUMN SPLIT)
───────────────────────────────────────────────
W_gate is [8192, 28672]. Split by columns:
GPU 0: W_gate₀ = W_gate[:, :3584] shape [8192, 3584]
...
Each GPU computes:
gate_local = X @ W_gate_local → [1, 1M, 3584]
up_local = X @ W_up_local → [1, 1M, 3584]
hidden_local = SiLU(gate_local) * up_local → [1, 1M, 3584]
NO COMMUNICATION — column split!
STEP 5: Down Projection (ROW SPLIT) ← PROBLEM AGAIN
─────────────────────────────────────────────────────
W_down is [28672, 8192]. Split by ROWS:
GPU 0: W_down₀ = W_down[:3584, :] shape [3584, 8192]
...
Each GPU computes:
ffn_out_local = hidden_local @ W_down_local → [1, 1M, 8192] (PARTIAL!)
To get the correct output:
ffn_out = ffn_out₀ + ffn_out₁ + ... + ffn_out₇
⚠️ ALLREDUCE #2: Sum [1, 1M, 8192] across 8 GPUs
Data: 1M × 8192 × 2 bytes = 16.38 GB
[1, 1M, 8192] = 16.38 GB.
Now let's do the detailed math to compare communication costs. This is where Ulysses really shines.
Each AllReduce needs to sum partial results across 8 GPUs and give every GPU the full sum. The efficient way to do this is in two phases:
HOW ALLREDUCE WORKS (for the output projection sum):
════════════════════════════════════════════════════
Each GPU has a partial result: O_local = [1, 1M, 8192] (16.38 GB)
We need: O_full = O₀ + O₁ + O₂ + ... + O₇ on EVERY GPU.
NAIVE APPROACH (Reduce + Broadcast):
Step 1: Send all data to GPU 0, sum there → GPU 0 has O_full
Step 2: GPU 0 broadcasts O_full to everyone
Problem: Only 1 GPU active at a time. Sequential. Slow.
EFFICIENT APPROACH (ReduceScatter + AllGather):
Phase 1 — ReduceScatter:
Split each GPU's 16.38 GB into 8 chunks of 2.05 GB.
GPU 0 is responsible for summing chunk 0 from all GPUs.
GPU 1 is responsible for summing chunk 1 from all GPUs.
...
All GPUs work in PARALLEL — each one sums a different chunk.
After N-1 ring steps:
GPU 0 has: fully summed chunk 0 (2.05 GB)
GPU 1 has: fully summed chunk 1 (2.05 GB)
...
Data sent per GPU: 16.38 × (N-1)/N = 16.38 × 7/8 = 14.33 GB
Phase 2 — AllGather:
Now each GPU shares its summed chunk with all others.
(Same as FSDP's AllGather from Part 2!)
After N-1 ring steps:
Every GPU has all 8 summed chunks = full O_full (16.38 GB)
Data sent per GPU: 14.33 GB
Total per AllReduce: 14.33 + 14.33 = 28.67 GB per GPU
Now let's calculate the full cost:
TENSOR PARALLELISM — Communication per layer:
══════════════════════════════════════════════
AllReduce #1 (after attention output projection):
Data: [1, 1M, 8192] × 2 bytes = 16.38 GB
Volume per GPU: 28.67 GB
AllReduce #2 (after FFN down projection):
Data: [1, 1M, 8192] × 2 bytes = 16.38 GB
Volume per GPU: 28.67 GB
Total per layer: 28.67 + 28.67 = 57.34 GB
Total for 80 layers: 57.34 × 80 = 4,587 GB ≈ 4.5 TB
Time (NVLink 300 GB/s per direction):
Per AllReduce: 28.67 GB ÷ 300 GB/s = 95.6 ms
Per layer: 2 × 95.6 = 191.2 ms
Total (80 layers): 191.2 × 80 = 15.3 seconds
ULYSSES — Communication per layer:
══════════════════════════════════
All-to-All #1 (before attention, on Q, K, V):
Q per GPU before: [1, 125K, 64, 128] × 2 bytes = 2.05 GB
K per GPU before: [1, 125K, 8, 128] × 2 bytes = 0.26 GB
V per GPU before: [1, 125K, 8, 128] × 2 bytes = 0.26 GB
Each GPU sends (N-1)/N of its data:
Q sent: 2.05 × 7/8 = 1.79 GB
K sent: 0.26 × 7/8 = 0.22 GB
V sent: 0.26 × 7/8 = 0.22 GB
Total sent: 2.23 GB
All-to-All #2 (after attention, on output):
Output per GPU: [1, 1M, 8, 128] × 2 bytes = 2.05 GB
Sent: 2.05 × 7/8 = 1.79 GB
Total per layer: 2.23 + 1.79 = 4.02 GB
Total for 80 layers: 4.02 × 80 = 322 GB
Time (NVLink 300 GB/s per direction):
Per All-to-All: ~2 GB ÷ 300 GB/s = 6.7 ms
Per layer: 2 × 6.7 = 13.4 ms
Total (80 layers): 13.4 × 80 = 1.07 seconds
| Metric | Tensor Parallelism | Ulysses | Ratio |
|---|---|---|---|
| Comm ops per layer | 2 × AllReduce | 2 × All-to-All | — |
| Data per layer | 57.3 GB | 4.0 GB | 14× less |
| Total (80 layers) | 4,587 GB | 322 GB | 14× less |
| Time per layer | 191 ms | 13.4 ms | 14× faster |
| Total time | 15.3 sec | 1.07 sec | 14× faster |
| FFN communication | Required (AllReduce) | None (local) | ∞ |
| Activation memory per GPU | 20.5 GB (full sequence) | 2.6 GB (1/8 sequence) | 8× less |
The difference comes from what gets communicated:
TENSOR PARALLELISM communicates:
The full hidden state: [1, 1M, 8192] = 16.38 GB
This is the FULL d_model dimension for ALL tokens.
And it does this TWICE per layer (attention + FFN).
ULYSSES communicates:
Q, K, V tensors: Q(2.05 GB) + K(0.26 GB) + V(0.26 GB) = 2.57 GB
This is much smaller because:
1. K and V are 8× smaller due to GQA (8 heads vs 64)
2. FFN requires ZERO communication (tokens are independent)
3. All-to-All is a pure redistribution (no reduction/summation)
Tensor parallelism isn't always worse. The key factor is latency vs bandwidth. Let's see a concrete example:
VERY SHORT SEQUENCE: seq_len = 128 tokens (single prompt, decode phase)
═══════════════════════════════════════════════════════════════════════
TENSOR PARALLELISM:
AllReduce data: [1, 128, 8192] × 2 bytes = 2.1 MB
Per AllReduce: 2 × 2.1 MB × 7/8 = 3.7 MB
Per layer: 2 × 3.7 MB = 7.4 MB
Time per layer: 7.4 MB ÷ 300 GB/s = 0.025 ms
BUT: AllReduce startup latency ≈ 5 μs × 2 = 10 μs = 0.01 ms
Effective time per layer: ~0.035 ms
ULYSSES:
Q per GPU: [1, 16, 64, 128] × 2 bytes = 0.26 MB
K per GPU: [1, 16, 8, 128] × 2 bytes = 0.033 MB
V per GPU: [1, 16, 8, 128] × 2 bytes = 0.033 MB
All-to-All data: 0.33 MB × 7/8 = 0.29 MB
Per layer: 2 × 0.29 MB = 0.58 MB
Time per layer: 0.58 MB ÷ 300 GB/s = 0.002 ms
BUT: All-to-All startup latency ≈ 15 μs × 2 = 30 μs = 0.03 ms
Effective time per layer: ~0.032 ms
At 128 tokens, they're roughly equal! The data is so small that
startup latency dominates, not bandwidth.
The real advantage of TP at short sequences is simplicity — no need to shard/unshard the sequence, and it's battle-tested in production systems like vLLM and TensorRT-LLM.
ALSO: Models with very few KV heads
════════════════════════════════════
Some models use Multi-Query Attention (MQA): only 1 KV head.
Ulysses with 1 KV head: max 1 GPU for KV All-to-All → can't parallelize!
TP with 1 KV head: works fine — splits by query heads instead.
RULE OF THUMB:
Long sequences (>1K tokens) + enough KV heads (≥ N GPUs) → Ulysses
Short sequences (<1K tokens) OR too few KV heads → Tensor Parallelism
For our use case — Llama-70B with 1M tokens and 8 KV heads on 8 GPUs — Ulysses is the clear winner by a wide margin.
Ulysses is powerful, but it has important constraints.
After All-to-All, each GPU needs at least one head. This limits the number of GPUs:
Llama-70B head counts:
Query heads: 64
KV heads: 8 (GQA)
Maximum GPUs for Ulysses:
If splitting by query heads: max 64 GPUs
If splitting by KV heads: max 8 GPUs ← this is the real limit!
Why KV heads are the bottleneck:
After All-to-All, each GPU needs at least 1 KV head.
With 8 KV heads and 8 GPUs: each GPU gets exactly 1 KV head. ✓
With 8 KV heads and 16 GPUs: each GPU gets 0.5 KV heads. ✗ Can't split!
Ulysses reduces activation memory during computation, but the KV cache for inference still needs to store all tokens:
KV Cache with Ulysses (8 GPUs):
After All-to-All, each GPU has full sequence for its KV heads.
Per GPU: [1, 1M, 1 KV head, 128] × 2 (K+V) × 80 layers × 2 bytes
= 1M × 1 × 128 × 2 × 80 × 2
= 41 GB per GPU
Total across 8 GPUs: 41 × 8 = 328 GB (same total as before)
Let's check: does everything actually fit on one A100?
FULL MEMORY BUDGET PER GPU (Ulysses + FSDP):
═════════════════════════════════════════════
Weight shard (FSDP): 17.5 GB
KV cache (1 KV head, 80 layers): 41.0 GB
Activations (Q,K,V for 125K): 2.6 GB
Buffers & overhead: ~2.0 GB
─────────────────────────────────────────
Total: 63.1 GB
A100 capacity: 80.0 GB
Headroom: 16.9 GB ✓
Unlike Ring AllGather (where each GPU only talks to its neighbors), All-to-All requires every GPU to send data to every other GPU simultaneously:
All-to-All communication pattern:
- Every GPU sends to every other GPU
- Requires full bisection bandwidth
- Works well within a single node (NVLink)
- Becomes expensive across nodes
WITHIN A NODE (8 GPUs, NVLink):
Bandwidth: 300 GB/s per direction
All-to-All time per layer: ~13.5 ms
Latency: ~0.5 μs per message
ACROSS NODES (InfiniBand):
Bandwidth: ~25 GB/s per direction (HDR InfiniBand)
All-to-All time per layer: ~13.5 ms × (300/25) ≈ 162 ms
Latency: ~1-5 μs per message
80 layers: 162 ms × 80 = 13 seconds (vs 1.1 seconds intra-node)
That's 12× slower!
This is why Ulysses is typically used within a single NVLink-connected node. For scaling beyond one node, Ring Attention (Part 4) is a better fit — it only communicates with neighbors, not all GPUs.
| Concept | What it does | Key insight |
|---|---|---|
| Sequence Sharding | Split tokens 1/N per GPU | Reduces activation memory from 20.5 GB to 2.6 GB |
| All-to-All | Redistribute seq↔head split | Pure shuffle — no data created or destroyed |
| Column Split | Split weight columns | Produces partial output — just concatenate |
| Row Split | Split weight rows | Produces partial sums — needs AllReduce |
| Ulysses vs TP | All-to-All vs AllReduce | 14× less communication for long sequences |
| Activation memory per GPU | ~2.6 GB (down from 20.5 GB) |
| All-to-All time per layer | ~13.5 ms (two All-to-Alls) |
| Total Ulysses comm (80 layers) | ~1.1 seconds |
| TP AllReduce equivalent | ~15.3 seconds (14× worse) |
| Max GPUs (KV head limited) | 8 (due to GQA) |
| KV cache per GPU | ~41 GB |
| ← Part 2: Weight Sharding (FSDP) | How we distributed 140 GB of weights |
| Part 4: Ring Attention | Distributed attention without gathering the full sequence — scaling beyond head count |
| Part 5: Putting It Together | The complete USP picture: FSDP + Ulysses + Ring Attention |
Weights are sharded. Sequence is sharded. Now let's shard the attention itself.