Yao Fu | Website | Blog | Twitter / X

University of Edinburgh

[email protected]

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaU5UTTJZek5rTmpreE1qRTBPV0V6T1RVNU16Rm1NV1U0TnpFek56QmtZaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

<aside> 💭 Yao: I want to write a punch line saying deep communication is through writing. Can you think of some sentences?

GPT-4: True depth is found not in speech, but in the quiet dance of pen on paper.

</aside>

Table of Content

Mar 2024 | How Do Language Models put Attention Weights over Long Context?

Yao Fu. University of Edinburgh

We are interested in the problem of lossless KV cache compression: to make the KV cache take less memory without sacrifacing language model’s capability during inference. We tend to view lossless KV cache compression is the number one challenge for democratizing and deployting long-context (100K - 10M) language models in real world.

But sorry, we won’t discuss any techniques related to KV cache compression in this post 😅. Instead, we look at its pre-requisition, i.e., the attention patterns inside the transformer architecture, because only an in-depth understanding of the attention mechanism allows us to find out which KV cache is compressible and which is not.

In this post, we discuss six typical attention patterns over long-context input, across all the transformer layers and heads, aiming to provide an intuitive understanding of what’s happening inside the transformer long-context attention, and potentially identify what part of KV cache is compressible.


Dec 2023 | Towards 100x Speedup: Full Stack Transformer Inference Optimization

Yao Fu. University of Edinburgh

Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?

Imagine a researcher has come up with a super smart decoding method: clever algorithm, solid math, but not compatible with FlashAttention. Can this method be used in production?

An in-depth understanding of transformer inference can be extremely beneficiary for both research and production. Yet in real world, large scale production is usually not so close to cutting edge research, such that people know algorithm may not know MLsys, and verse visa.

In this post, we discuss full-stack transformer inference optimization, from hardware specs like A100 memory hierarchy, to MLSys methods like FlashAttention and vLLM, to model architectures like Mixture of Experts, to decoding algorithms like Speculative Decoding and its variants. Like adding buffs in an RPG game, we see how transformer inference is scaled and speed up, step by step.


Sep 2023 | An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

Yao Fu. University of Edinburgh

Recently, the focus of research and open-source community are gradually shifting from model engineering to data engineering, realizing the crucial importance of data quality. However, when we say “we want better data”, what does “better data” mean precisely? When we say “we optimize the data composition”, what is the objective that we are optimizing?

We would like to study the theoretical support for the language modeling data engineering. We believe that in-depth understanding of the problem is equally important as developing methods to solve the problem, and theoretical analysis will lead us to predictable scaling: to predict the eventual performance on every single task before we actually run the experiments.