Announcements
Research
Future leakage in block-quantized attention
Simple and fast Rust deriving using macro_rules
Speculative Decoding with Blockwise Sparse Attention
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Prioritize values over keys: faster attention with many sparsely accessed value heads
Optimize for inference too, not just training FLOPs
Introducing seqax: A Simple and Efficient LLM Research Codebase