Future leakage in block-quantized attention

Quantizing attention improves efficiency on two fronts: the model has higher compute throughput, and loads fewer bytes per key/value. However, training with block quantized attention can break causal modeling. We present a fix that enables training with MXFP4 in both attention and the attention gradient.

Causal modeling

In causal language modeling, the final logits at position ii must depend only on tokens at positions i\le i. Future leakage is when information from positions >i>i may influence the logits at position ii. It poses an issue because it causes a skew between training and decode. In typical setups, causal masks prevent leakage in attention. But block quantization can introduce a subtle new path for future leakage.

Block quantization

Modern accelerators require using block-quantized matrix multiplications for highest throughput. To use these instructions to compute A×BA \times B, the row vectors of AA and column vectors of BB must be split into blocks of size kk and quantized. The specific value of kk depends on the format being used. For example, the microscaling formats use k=32k=32.

Within a block, let x0,x1,,xk1x_0, x_1, \ldots, x_{k-1} denote the precise (unquantized) elements. When quantizing, we approximate each element xix_i with a quantized value qiq_i and a scale ss shared across the block: xiqisx_i \approx q_i \cdot s

There are various approaches to selecting ss with different tradeoffs. But in general they choose an ss with a function of all elements in the block.

Given ss, the quantized elements are:

qi=round(xis)q_i = \text{round}\left(\frac{x_i}{s}\right)

With this procedure, ss and consequently qiq_i depend on all the pre-quantized elements in the block. So if the quantization block spans across different token positions, quantization enables the later tokens in the block to influence earlier tokens.

Quantizing attention

Causal attention for a single head takes queries Q\mathbf{Q}, keys K\mathbf{K}, and values V\mathbf{V} as inputs, and produces:

P=softmax(Q×KT+M)output=P×V\begin{array}{lcl} \mathbf{P} & = & \text{softmax}(\mathbf{Q} \times \mathbf{K}^T + \mathbf{M}) \\ \text{output} & = & \mathbf{P} \times \mathbf{V} \end{array}

where M\mathbf{M} is the causal mask.

Our goal is to use block-quantized matrix multiplications for Q×KT\mathbf{Q} \times \mathbf{K}^T and P×V\mathbf{P} \times \mathbf{V}. So Q\mathbf{Q} and K\mathbf{K} need to be quantized in blocks formed along the head dimension, while P\mathbf{P} and V\mathbf{V} need to be quantized in blocks formed along different token positions.

Quantizing P\mathbf{P} is safe despite blocking along token positions, since the causal mask zeros out future probabilities. However, the quantized V\mathbf{V} at position jj can depend on values at positions >j> j, which can cause future leakage.

When does quantized V\mathbf{V} cause future leakage?

Consider query position ii and value position jj, with block indices:

bi=ik,bj=jkb_i = \left\lfloor \frac{i}{k} \right\rfloor, \quad b_j = \left\lfloor \frac{j}{k} \right\rfloor

Leakage occurs when query position ii and value position jj share a quantization block position (bi=bjb_i = b_j):

  • bi=bjb_i = b_j (block-diagonal): query ii attends to value jj if jij \le i. But the quantized value at jj is computed from all positions in the block (including positions greater than ii). This breaks causality since the attention output at position ii can depend on positions greater than ii.

  • bi>bjb_i > b_j (past blocks): All positions in block bjb_j precede the first position in block bib_i. No leakage.

  • bi<bjb_i < b_j (future blocks): The causal mask zeros P[i,j]\mathbf{P}[i,j] for j>ij > i. No leakage.

Solution and validation

Leakage only occurs when the ii-th block-diagonal tile of P\mathbf{P} is multiplied with the ii-th quantized block of V\mathbf{V}. So we can prevent future leakage by using unquantized P\mathbf{P} and V\mathbf{V} when multiplying the ii-th tile of P\mathbf{P} with the ii-th block of V\mathbf{V}, while using block quantized matrix multiplications everywhere else.

To validate this fix, we trained a pair of 1B-parameter models on the C4 dataset, with MXFP4 for attention and the attention gradient. Both models share the following configuration:

Parameter Value
Layers 8
d_model 2048
d_head 128
d_ff 16384
Attention heads 16
Context length 1024
Scale selection maxabs calibration
Quantized ops matrix multiplies in attention + matrix multiplies in attention gradient
Forwards pass rounding Round-to-Nearest
Backwards pass rounding Stochastic Rounding

We trained a “Leaky” model that used MXFP4 for all of P×V\mathbf{P} \times \mathbf{V}, and a “Fixed” model that used the proposed solution.

The Fixed model remained well behaved throughout training. However, the Leaky model had training dynamics associated with future leakage: the grad norms grew rapidly before the loss started to improve suspiciously fast.

To confirm that the Leaky model was only doing better because of future leakage, we evaluated loss in two modes on a heldout set. In parallel mode, we ran prefill once per sequence. In autoregressive mode, we averaged the loss for generating the sequence one token at a time.

Model Parallel Autoregressive Gap
Leaky 2.56 2.66 +0.10
Fixed 2.64 2.64 0.00

Despite the Leaky model’s parallel loss being lower than the Fixed model’s, the autoregressive loss was worse. The Leaky model’s gap in loss between parallel and autoregressive modes indicates that it was reliant on future signal. The Fixed model had no gap, demonstrating our solution worked.

The Leaky model’s parallel mode was using quantization error to encode information about upcoming tokens. Padding a block with zeros (like we do in autoregressive mode) won’t change the selected scale. But adding outliers to the end (which can happen in parallel mode) makes the scale larger. Larger scales can cause earlier values to underflow.

Example where a future outlier causes the second value to underflow.

The model can use whether earlier values in a block underflowed to infer something about upcoming tokens.

Final thoughts

We hope the methodology here provides a starting point for researchers interested in quantizing attention during training. In future work, we aim to demonstrate that quantized attention can match the quality of float baselines while still providing end-to-end speedups.

If these research problems sound interesting, consider working with us!