Yao Fu | Website | Blog | Twitter / X
University of Edinburgh | [email protected]
Released on May 13 2024, Updated on Jun 28 2024
https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SW1WbE1qVmtNMkUzTjJKaE1UUm1Oek5pT0dGbE1Ua3hORGRtTnpka05XVXlJbjA9
Update Jun 2024: we strongly recommend reading the Mooncake paper and Character AI’s blog for a real-world long-context deployment solutions.
Why long-context models are important? Because they are the foundations for advanced AI applications such as hour-long video understanding, repository-level coding agents, and life-long AI companion. Our research objective is to foster an AI-based application ecosystem. For this to happen, we have to reduce the deployment cost of long-context transformers.
This is the second season of our transformer inference optimization posts. In our first post, we discuss generic short-context inference optimization. This post focuses on long-context optimization. We aim to address an ambitious research challenge:
<aside> 💡 How to reduce the deployment of 1M context production-level transformers to be as cheap as 4K?
</aside>
To tackle a problem we need understand the problem first. To that end we describe a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis about how all additional computational cost, compared to 4K context, trace back to one single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges:
We further analyze how existing efforts address the deployment challenges from these four perspectives and identify possibilities of combining them to build end- to-end efficient systems. We hope our work offers a foundational framework for analyzing long context transformer deployment and identifies important directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Table of Contents
Consider a 30+B 100K context GPT-3.5 quality open-source models like QWen or Yi, the differences between KV cache for 4K v.s. 100K context is: