MatX logo

High throughput chips for LLMs

SPIRe: Boosting LLM Inference Throughput with Speculative Decoding April 08, 2025 Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3× for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model’s KV cache is sparse. We i… Continue reading

Prioritize values over keys: faster attention with many sparsely accessed value heads April 08, 2025 During Transformer decoding, KV cache size and memory bandwidth requirements can limit overall throughput. Multi Query Attention is a powerful technique to mitigate this, but some models such as the Llama family have not deployed it due to quality concerns, opting for the less aggressive Grouped Query Attention instead. An alternative approach is to sparsely access the attention values but this ha… Continue reading

Optimize for inference too, not just training FLOPs January 08, 2025 We discuss the importance of and strategies to balance training and inference costs when selecting LLM architectures. Continue reading

Introducing seqax: A Simple and Efficient LLM Research Codebase May 06, 2024 We’re excited to announce seqax, a research-focused LLM codebase that is simple, efficient, and performs well on up to 100 GPUs or TPUs. Everything you need to edit, from the math, to parallelism, to memory footprint, is all there in 500 lines of JAX code. Continue reading