Future leakage in block-quantized attention

January 09, 2026

Akshay Mishra, Reiner Pope, Sanjit Neelam, Daniel Heinlein, Vaclav Cvicek, Zaal Vasania, and James Hill-Khurana

Quantizing attention improves efficiency on two fronts: the model has higher compute throughput, and loads fewer bytes per key/value. However, training with block quantized attention can break causal modeling. We present a fix that enables training with MXFP4 in both attention and the attention gradient.

Causal modeling

In causal language modeling, the final logits at position $i$ must depend only on tokens at positions $\le i$ . Future leakage is when information from positions $>i$ may influence the logits at position $i$ . It poses an issue because it causes a skew between training and decode. In typical setups, causal masks prevent leakage in attention. But block quantization can introduce a subtle new path for future leakage.

Block quantization

Modern accelerators require using block-quantized matrix multiplications for highest throughput. To use these instructions to compute $A \times B$ , the row vectors of $A$ and column vectors of $B$ must be split into blocks of size $k$ and quantized. The specific value of $k$ depends on the format being used. For example, the microscaling formats use $k=32$ .

Within a block, let $x_0, x_1, \ldots, x_{k-1}$ denote the precise (unquantized) elements. When quantizing, we approximate each element $x_i$ with a quantized value $q_i$ and a scale $s$ shared across the block: $x_i \approx q_i \cdot s$

There are various approaches to selecting $s$ with different tradeoffs. But in general they choose an $s$ with a function of all elements in the block.

Given $s$ , the quantized elements are:

$q_i = \text{round}\left(\frac{x_i}{s}\right)$

With this procedure, $s$ and consequently $q_i$ depend on all the pre-quantized elements in the block. So if the quantization block spans across different token positions, quantization enables the later tokens in the block to influence earlier tokens.

Quantizing attention

Causal attention for a single head takes queries $\mathbf{Q}$ , keys $\mathbf{K}$ , and values $\mathbf{V}$ as inputs, and produces:

$\begin{array}{lcl} \mathbf{P} & = & \text{softmax}(\mathbf{Q} \times \mathbf{K}^T + \mathbf{M}) \\ \text{output} & = & \mathbf{P} \times \mathbf{V} \end{array}$

where $\mathbf{M}$ is the causal mask.

Our goal is to use block-quantized matrix multiplications for $\mathbf{Q} \times \mathbf{K}^T$ and $\mathbf{P} \times \mathbf{V}$ . So $\mathbf{Q}$ and $\mathbf{K}$ need to be quantized in blocks formed along the head dimension, while $\mathbf{P}$ and $\mathbf{V}$ need to be quantized in blocks formed along different token positions.

Quantizing $\mathbf{P}$ is safe despite blocking along token positions, since the causal mask zeros out future probabilities. However, the quantized $\mathbf{V}$ at position $j$ can depend on values at positions $> j$ , which can cause future leakage.

When does quantized $\mathbf{V}$ cause future leakage?

Consider query position $i$ and value position $j$ , with block indices:

$b_i = \left\lfloor \frac{i}{k} \right\rfloor, \quad b_j = \left\lfloor \frac{j}{k} \right\rfloor$

Leakage occurs when query position $i$ and value position $j$ share a quantization block position ( $b_i = b_j$ ):

$b_i = b_j$ (block-diagonal): query $i$ attends to value $j$ if $j \le i$ . But the quantized value at $j$ is computed from all positions in the block (including positions greater than $i$ ). This breaks causality since the attention output at position $i$ can depend on positions greater than $i$ .
$b_i > b_j$ (past blocks): All positions in block $b_j$ precede the first position in block $b_i$ . No leakage.
$b_i < b_j$ (future blocks): The causal mask zeros $\mathbf{P}[i,j]$ for $j > i$ . No leakage.

Solution and validation

Leakage only occurs when the $i$ -th block-diagonal tile of $\mathbf{P}$ is multiplied with the $i$ -th quantized block of $\mathbf{V}$ . So we can prevent future leakage by using unquantized $\mathbf{P}$ and $\mathbf{V}$ when multiplying the $i$ -th tile of $\mathbf{P}$ with the $i$ -th block of $\mathbf{V}$ , while using block quantized matrix multiplications everywhere else.

To validate this fix, we trained a pair of 1B-parameter models on the C4 dataset, with MXFP4 for attention and the attention gradient. Both models share the following configuration:

Parameter	Value
Layers	8
d_model	2048
d_head	128
d_ff	16384
Attention heads	16
Context length	1024
Scale selection	maxabs calibration
Quantized ops	matrix multiplies in attention + matrix multiplies in attention gradient
Forwards pass rounding	Round-to-Nearest
Backwards pass rounding	Stochastic Rounding

We trained a “Leaky” model that used MXFP4 for all of $\mathbf{P} \times \mathbf{V}$ , and a “Fixed” model that used the proposed solution.

The Fixed model remained well behaved throughout training. However, the Leaky model had training dynamics associated with future leakage: the grad norms grew rapidly before the loss started to improve suspiciously fast.

To confirm that the Leaky model was only doing better because of future leakage, we evaluated loss in two modes on a heldout set. In parallel mode, we ran prefill once per sequence. In autoregressive mode, we averaged the loss for generating the sequence one token at a time.

Model	Parallel	Autoregressive	Gap
Leaky	2.56	2.66	+0.10
Fixed	2.64	2.64	0.00

Despite the Leaky model’s parallel loss being lower than the Fixed model’s, the autoregressive loss was worse. The Leaky model’s gap in loss between parallel and autoregressive modes indicates that it was reliant on future signal. The Fixed model had no gap, demonstrating our solution worked.

The Leaky model’s parallel mode was using quantization error to encode information about upcoming tokens. Padding a block with zeros (like we do in autoregressive mode) won’t change the selected scale. But adding outliers to the end (which can happen in parallel mode) makes the scale larger. Larger scales can cause earlier values to underflow.

Example where a future outlier causes the second value to underflow.

The model can use whether earlier values in a block underflowed to infer something about upcoming tokens.

Final thoughts

We hope the methodology here provides a starting point for researchers interested in quantizing attention during training. In future work, we aim to demonstrate that quantized attention can match the quality of float baselines while still providing end-to-end speedups.