Yao Fu | Website | Blog | Twitter / X

University of Edinburgh | [email protected]

Released on Mar 05 2024


Update Mar 15 2024: Check the recent DMC paper which implement the compression closely related to our analysis.

We are interested in the problem of lossless KV cache compression: to make the KV cache take less memory without sacrifacing language model’s capability during inference. Just like lossless data compression is the number one principle for scaling, we view lossless KV cache compression as the number one challenge for democratizing and deployting long-context (100K - 10M) language models in real world, simply because they are too large.

But sorry, we won’t discuss any techniques related to KV cache compression in this post 😅. Instead, we look at its pre-requisition, i.e., the attention patterns inside the transformer architecture, because only an in-depth understanding of the attention mechanism allows us to find out which KV cache is compressible and which is not.

In this post, we discuss six typical attention patterns over long-context input, across all the transformer layers and heads, aiming to provide an intuitive understanding of what’s happening inside the transformer long-context attention, and potentially identify what part of KV cache is compressible.

Table of Content

1 - The Attention Probability Tensor

Say you do Needle-in-a-Haystack. Your input is a document of 100K length and in the middle, there is a sentence “The best thing to do in San Francisco is sitting in Dolores park and eating a sandwidth in a sunny day”. Your prompt is “The best thing to do in San Francisco is”. Then you want to see if your model can retrieve the needle “sitting in Dolores park and eating a sandwidth in a sunny day”. When the model generates the response, what does its attention over 100k input look like?

We perform this experiment on our recently release LLaMA-2-7B-80K checkpoint and retrieve its attention tensor. An attention tensor has three dimentions / ranks: depth, heads, and context length, as is shown below:


1.1 - Network Depth

We first note that the attention distribution in layer 0, 1 and 31 are quite different than the attention within layers 2 - 30.

Layer 0 and layer 1: mostly uniform