Yao Fu | Website | Blog | Twitter / X

University of Edinburgh | [email protected]

Started writing in Sep 2023, released on Dec 11 2023, last updated on May 17 2024

Thank Hao Zhang @ LMSys, Hao Peng @ UIUC, Swaroop Mishra @ GDM for insightful discussions

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SWpRek1USTBZek0yT0RobE1UUmpabVpoWmpKbU1XUTJZMkprWmpJMll6WmpJbjA9

Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?

Company B, because its cost is cheaper

Imagine a researcher has come up with a super smart decoding method: clever algorithm, solid math, but not compatible with FlashAttention. Can this method be used in production?

Probably not, because flash attention is essential for large scale model deployment

An in-depth understanding of transformer inference can be extremely beneficiary for both research and production. Yet in real world, large scale production is usually not so close to cutting edge research, such that people know algorithm may not know MLsys, and verse visa.

In this post, we discuss full-stack transformer inference optimization, from hardware specs like A100 memory hierarchy, to MLSys methods like FlashAttention and vLLM, to model architectures like Mixture of Experts, to decoding algorithms like Speculative Decoding and its variants. We identify the most foundamental fact that transformer inference is memory bound, and most of the optimization, either from MLSys or from modeling, is based on / exploiting this fact. Like adding buffs in an RPG game, we see how transformer inference is scaled and speed up, step by step.

Cite this work

Table of Content

1 - Hardware: inference on GPUs

We start with a discussion of GPU architecture, particularly its memory hierarchy. We identify two important patterns: compute bound and memory bound, and discuss why large transformer inference is memory bound. Then most of the optimization is based on the foundamental knowledge that transformer inference is memory bound, such as as long as we improve the flop utilization we can improve efficiency.

1.1 - Preliminary

1.1.1 - GPU architecture

Overall it looks like this

Untitled

Basics: DRAM, L2 cache, and SM