Home | Twitter / X | Google Scholar
Hugging Face | Github | About Yao Fu
https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaU5UTTJZek5rTmpreE1qRTBPV0V6T1RVNU16Rm1NV1U0TnpFek56QmtZaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9
<aside> 💭 Yao: I want to write a punch line saying deep communication is through writing. Can you think of some sentences?
Language Model: True depth is found not in speech, but in the quiet dance of pen on paper.
</aside>
<aside> 💡 All facts in this blog are based on existing public information, mostly from Arxiv, Huggingface and Github. All opinions are my own.
</aside>
Table of Content
We present an information theoretic framework for understanding vision language models. Unlike classical compression theory that treats language modeling as source coding **with the objective of better lossless compressing text bits, we treat vision-language modeling, specifically image understanding, as channel coding with the objective of transmitting image bits from the raw image through the language model backbone. We treat language model as a lossy channel, not a lossless compressor in the classical text-only case.
We derive describable information, the upper-bound of the bits that can be transmitted though a language channel, and show that it is a special subset of the irreducible bits, i.e., the Kolmogorov complexity of the input image. We discuss the channel capacity of a language, which only depends on the expressivity of the language, but not depend on the capability of the agent. Our framework forgoes the community’s long-sought goal of seeking intelligence from images, but instead seeking maximally transmit describable information via the language channel.
Language modeling is compression, but vision-language modeling is transmission.
Why long-context models are important? Because they are the foundations for advanced AI applications such as hour-long video understanding, repository-level coding agents, and life-long AI companion. Our research objective is to foster an AI-based application ecosystem. For this to happen, we have to reduce the deployment cost of long-context transformers.
This is the second season of our transformer inference optimization posts. This post focuses on long-context optimization. We aim to address an ambitious research challenge: How to reduce the deployment of 1M context production-level transformers to be as cheap as 4K?
we describe a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis about how all additional computational cost, compared to 4K context, trace back to one single source: the large size of the KV cache. We further analyze how existing efforts address the deployment challenges from the perspectives of concurrency, prefilling, decoding, and context-switching, and identify possibilities of combining them to build end- to-end efficient systems.
The scaling of text data is likely reaching a ceiling as most of the easy web text (Common Crawl, Github, Arxiv .etc) are now used up. New text data may only incrementally improve model performance because they may not add another order of magnitude. The first chapter of the game of scale, namely scaling up text data, is coming to a conclusion where frontier models are all about GPT-4 parity. Video data can be orders of magnitudes larger than text data. They significantly improves the perception of language models, and opens the possibility of large world models. However, it seems that video data cannot improve reasoning. Reinforcement learning have not yet been scaled, and most existing work only focus on single-step offline optimization. Scaling up the exploration and exploitation with online iterative RL from human, environment, and AI feedback could potentially further improve the model’s reasoning.