Yao Fu | Website | Blog | Twitter / X

University of Edinburgh | [email protected]

Released on May 13 2024, Updated on Jun 28 2024

Paper version


Update Jun 2024: we strongly recommend reading the Mooncake paper and Character AI’s blog for a real-world long-context deployment solutions.

Why long-context models are important? Because they are the foundations for advanced AI applications such as hour-long video understanding, repository-level coding agents, and life-long AI companion. Our research objective is to foster an AI-based application ecosystem. For this to happen, we have to reduce the deployment cost of long-context transformers.

This is the second season of our transformer inference optimization posts. In our first post, we discuss generic short-context inference optimization. This post focuses on long-context optimization. We aim to address an ambitious research challenge:

<aside> 💡 How to reduce the deployment of 1M context production-level transformers to be as cheap as 4K?


To tackle a problem we need understand the problem first. To that end we describe a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis about how all additional computational cost, compared to 4K context, trace back to one single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges:

  1. prefilling long inputs takes much longer compute time and GPU memory than short inputs;
  2. after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served;
  3. during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency;
  4. when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency

We further analyze how existing efforts address the deployment challenges from these four perspectives and identify possibilities of combining them to build end- to-end efficient systems. We hope our work offers a foundational framework for analyzing long context transformer deployment and identifies important directions towards reducing the inference cost of 1M context to be as cheap as 4K.

Table of Contents

1 - A concurrent programming framework under limited GPU HBM size

Consider a 30+B 100K context GPT-3.5 quality open-source models like QWen or Yi, the differences between KV cache for 4K v.s. 100K context is: