https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaU5UTTJZek5rTmpreE1qRTBPV0V6T1RVNU16Rm1NV1U0TnpFek56QmtZaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

<aside> 💭 Yao: I want to write a punch line saying deep communication is through writing. Can you think of some sentences?

Language Model: True depth is found not in speech, but in the quiet dance of pen on paper.

</aside>

<aside> 💡 All facts in this blog are based on existing public information, mostly from Arxiv, Huggingface and Github. All opinions are my own.

</aside>

Table of Content

May 2024 | Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models

Yao Fu. University of Edinburgh

Why long-context models are important? Because they are the foundations for advanced AI applications such as hour-long video understanding, repository-level coding agents, and life-long AI companion. Our research objective is to foster an AI-based application ecosystem. For this to happen, we have to reduce the deployment cost of long-context transformers.

This is the second season of our transformer inference optimization posts. This post focuses on long-context optimization. We aim to address an ambitious research challenge: How to reduce the deployment of 1M context production-level transformers to be as cheap as 4K?

we describe a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis about how all additional computational cost, compared to 4K context, trace back to one single source: the large size of the KV cache. We further analyze how existing efforts address the deployment challenges from the perspectives of concurrency, prefilling, decoding, and context-switching, and identify possibilities of combining them to build end- to-end efficient systems.

Apr 2024 | Llama 3 Opens the Second Chapter of the Game of Scale

Yao Fu. University of Edinburgh

The scaling of text data is likely reaching a ceiling as most of the easy web text (Common Crawl, Github, Arxiv .etc) are now used up. New text data may only incrementally improve model performance because they may not add another order of magnitude. The first chapter of the game of scale, namely scaling up text data, is coming to a conclusion where frontier models are all about GPT-4 parity. Video data can be orders of magnitudes larger than text data. They significantly improves the perception of language models, and opens the possibility of large world models. However, it seems that video data cannot improve reasoning. Reinforcement learning have not yet been scaled, and most existing work only focus on single-step offline optimization. Scaling up the exploration and exploitation with online iterative RL from human, environment, and AI feedback could potentially further improve the model’s reasoning.

Mar 2024 | How Do Language Models put Attention Weights over Long Context?

Yao Fu. University of Edinburgh

We are interested in the problem of lossless KV cache compression: to make the KV cache take less memory without sacrifacing language model’s capability during inference. We tend to view lossless KV cache compression is the number one challenge for democratizing and deployting long-context (100K - 10M) language models in real world.

But sorry, we won’t discuss any techniques related to KV cache compression in this post 😅. Instead, we look at its pre-requisition, i.e., the attention patterns inside the transformer architecture, because only an in-depth understanding of the attention mechanism allows us to find out which KV cache is compressible and which is not.

In this post, we discuss six typical attention patterns over long-context input, across all the transformer layers and heads, aiming to provide an intuitive understanding of what’s happening inside the transformer long-context attention, and potentially identify what part of KV cache is compressible.