Back to Signals Desk
Signals Desk // ai-newsVerified Brief

IndexCache: Speeding Up Long-Context Inference by 1.82x by Breaking the Sparse Attention Bottleneck

Researchers from Tsinghua University and Z.ai have released IndexCache, a novel optimizer for sparse attention. The technique exploits the redundancy in key information selection across different model layers by dividing them into computation and shared layers, effectively cutting up to 75% of redundant calculations. When processing 200,000-token contexts, IndexCache can boost a model's time-to-first-token by 1.82x and throughput by 1.48x. This technology primarily targets models with DSA architectures, like DeepSeek and GLM, and is designed to solve the indexer computation bottleneck in long-context inference, rather than focusing on traditional KV cache compression.

AI技术前沿大模型优化

Processing contexts of up to 200,000 tokens is an expensive and slow ordeal for any large language model (LLM). The longer the context, the higher the cost and latency. Researchers from Tsinghua University and Z.ai have introduced a new technique called IndexCache that slashes up to 75% of redundant computations in sparse attention models, boosting time-to-first-token (TTFT) by 1.82x and generation throughput by 1.48x when handling long texts.

The technique has been initially validated on the 744-billion-parameter GLM-5 model and is applicable to models using the DeepSeek Sparse Attention (DSA) architecture, such as the latest DeepSeek and GLM series. This means that enterprises deploying production-grade, long-context applications can now offer users a much faster and more responsive experience.

The 'Quadratic Curse' of Self-Attention

The core capability of large language models stems from the self-attention mechanism. In simple terms, to predict the next token, the model must calculate the relationship between the current token and all preceding ones. While powerful, this mechanism has a critical flaw: its computational complexity and memory consumption grow quadratically (O(n²)) with the sequence length.

When applications need to process long documents, execute multi-step AI agent workflows, or perform long chain-of-thought reasoning, this 'quadratic curse' causes inference speeds to plummet and computational and memory costs to soar. Sparse attention was developed precisely to solve this problem. Instead of having each token attend to all previous tokens, it optimizes the process by selecting and focusing on only a small, most relevant subset, thereby breaking the quadratic scaling barrier.

The 'Hidden Tax' of Sparse Attention

DeepSeek Sparse Attention (DSA), first introduced in DeepSeek-V3.2, is an efficient implementation of this concept. To identify the most important tokens, DSA incorporates a lightweight 'lightning indexer module' at every layer of the model. This indexer scores all previous tokens and then selects a small batch to be processed by the core attention mechanism. By doing so, DSA successfully reduces the computational complexity of the core attention from quadratic to linear, significantly speeding up the model while maintaining performance.

However, the researchers identified an overlooked bottleneck: the DSA indexer itself still has a quadratic computational complexity at each layer. Although the indexer's computational load is much smaller than that of the core attention, the time spent running these indexers climbs sharply as context lengths explode. This 'indexing tax' severely slows down the model during the prefill stage, which is when the initial input prompt is processed.

IndexCache's Solution: Exploiting Inter-Layer Redundancy

To solve the indexer bottleneck, the research team discovered a key property: as data passes through the layers of a Transformer model, the subset of important tokens selected by the indexer shows remarkable stability. Empirical tests show that the overlap of selected tokens between adjacent layers in DSA models is as high as 70% to 100%.

IndexCache was born from this insight. The technique divides the model's layers into two types: a few 'Full Layers' (F-layers) and the remaining 'Shared Layers' (S-layers). F-layers retain their indexers to actively compute and cache the indices of the most important tokens. In contrast, S-layers skip the indexing computation entirely and directly reuse the cached indices from the nearest preceding F-layer.

During inference, the model simply checks the type of the current layer. If it's an F-layer, it performs the computation and updates the cache. If it's an S-layer, it reads directly from the cache, skipping the computation. This mechanism cleverly exploits inter-layer redundancy to drastically reduce the total computational load.

Computational Optimization, Not Memory Compression

It is important to note that IndexCache is fundamentally different from common KV cache compression techniques. Traditional KV cache optimizations aim to reduce the memory footprint required to store attention results, whereas IndexCache directly targets the computational bottleneck itself.

"IndexCache is not a traditional KV cache compression or sharing technique," Yushi Bai, a co-author of the paper, explained to VentureBeat. "It reduces computation, not just memory, by reusing indices across layers to eliminate redundancy. It’s complementary to existing methods and can be used in conjunction with them."

For ease of deployment, the researchers also developed a training-free method. A 'greedy layer selection' algorithm can automatically determine the optimal distribution of F-layers and S-layers by running the model on a small amount of calibration data, without needing to update any model weights. This means developers can easily apply IndexCache to off-the-shelf DSA models that cannot be retrained, or where retraining is impractical, breathing new life into them and making long-context AI applications truly smooth and efficient.

Citations and source links