Future leakage in block-quantized attention
Quantizing attention improves efficiency on two fronts: the model has higher compute throughput, and loads fewer bytes per key/value. However, training with block quantized attention can break causal modeling. We present a fix that enables training with MXFP4 in both attention and the attention gradient.
Causal modeling
In causal language modeling, the final logits at position must depend only on tokens at positions . Future leakage is when information from positions may influence the logits at position . It poses an issue because it causes a skew between training and decode. In typical setups, causal masks prevent leakage in attention. But block quantization can introduce a subtle new path for future leakage.
Block quantization
Modern accelerators require using block-quantized matrix multiplications for highest throughput. To use these instructions to compute , the row vectors of and column vectors of must be split into blocks of size and quantized. The specific value of depends on the format being used. For example, the microscaling formats use .
Within a block, let denote the precise (unquantized) elements. When quantizing, we approximate each element with a quantized value and a scale shared across the block:
There are various approaches to selecting with different tradeoffs. But in general they choose an with a function of all elements in the block.
Given , the quantized elements are:
With this procedure, and consequently depend on all the pre-quantized elements in the block. So if the quantization block spans across different token positions, quantization enables the later tokens in the block to influence earlier tokens.
Quantizing attention
Causal attention for a single head takes queries , keys , and values as inputs, and produces:
where is the causal mask.
Our goal is to use block-quantized matrix multiplications for and . So and need to be quantized in blocks formed along the head dimension, while and need to be quantized in blocks formed along different token positions.
Quantizing is safe despite blocking along token positions, since the causal mask zeros out future probabilities. However, the quantized at position can depend on values at positions , which can cause future leakage.
When does quantized cause future leakage?
Consider query position and value position , with block indices:
Leakage occurs when query position and value position share a quantization block position ():
(block-diagonal): query attends to value if . But the quantized value at is computed from all positions in the block (including positions greater than ). This breaks causality since the attention output at position can depend on positions greater than .
(past blocks): All positions in block precede the first position in block . No leakage.
(future blocks): The causal mask zeros for . No leakage.
Solution and validation
Leakage only occurs when the -th block-diagonal tile of is multiplied with the -th quantized block of . So we can prevent future leakage by using unquantized and when multiplying the -th tile of with the -th block of , while using block quantized matrix multiplications everywhere else.
To validate this fix, we trained a pair of 1B-parameter models on the C4 dataset, with MXFP4 for attention and the attention gradient. Both models share the following configuration:
| Parameter | Value |
|---|---|
| Layers | 8 |
| d_model | 2048 |
| d_head | 128 |
| d_ff | 16384 |
| Attention heads | 16 |
| Context length | 1024 |
| Scale selection | maxabs calibration |
| Quantized ops | matrix multiplies in attention + matrix multiplies in attention gradient |
| Forwards pass rounding | Round-to-Nearest |
| Backwards pass rounding | Stochastic Rounding |
We trained a “Leaky” model that used MXFP4 for all of , and a “Fixed” model that used the proposed solution.
The Fixed model remained well behaved throughout training. However, the Leaky model had training dynamics associated with future leakage: the grad norms grew rapidly before the loss started to improve suspiciously fast.
To confirm that the Leaky model was only doing better because of future leakage, we evaluated loss in two modes on a heldout set. In parallel mode, we ran prefill once per sequence. In autoregressive mode, we averaged the loss for generating the sequence one token at a time.
| Model | Parallel | Autoregressive | Gap |
|---|---|---|---|
| Leaky | 2.56 | 2.66 | +0.10 |
| Fixed | 2.64 | 2.64 | 0.00 |
Despite the Leaky model’s parallel loss being lower than the Fixed model’s, the autoregressive loss was worse. The Leaky model’s gap in loss between parallel and autoregressive modes indicates that it was reliant on future signal. The Fixed model had no gap, demonstrating our solution worked.
The Leaky model’s parallel mode was using quantization error to encode information about upcoming tokens. Padding a block with zeros (like we do in autoregressive mode) won’t change the selected scale. But adding outliers to the end (which can happen in parallel mode) makes the scale larger. Larger scales can cause earlier values to underflow.
Example where a future outlier causes the second value to underflow.
The model can use whether earlier values in a block underflowed to infer something about upcoming tokens.
Final thoughts
We hope the methodology here provides a starting point for researchers interested in quantizing attention during training. In future work, we aim to demonstrate that quantized attention can match the quality of float baselines while still providing end-to-end speedups.
If these research problems sound interesting, consider working with us!