An Transmission Theory for Vision-Language Models

Yao Fu

Nov 2025

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSkRjbEpXY0dSRGRXdGhhekoxT1VoMVZGbGtkeUlzSW5CaFoyVkpaQ0k2SWpJNVltVTNPVEZpWmpNd01qZ3dZVFE1WkRNelpHVm1aREUyTkRZMk1HUTRJbjA9

We present an information theoretic framework for understanding vision language models. Unlike classical compression theory that treats language modeling as source coding **with the objective of better lossless compressing text bits, we treat vision-language modeling, specifically image understanding, as channel coding with the objective of transmitting image bits from the raw image through the language model backbone. We treat language model as a lossy channel, not a lossless compressor in the classical text-only case.

We derive describable information, the upper-bound of the bits that can be transmitted though a language channel, and show that it is a special subset of the irreducible bits, i.e., the Kolmogorov complexity of the input image. We discuss the channel capacity of a language, which only depends on the expressivity of the language, but not depend on the capability of the agent. Our framework forgoes the community’s long-sought goal of seeking intelligence from images, but instead seeking maximally transmit describable information via the language channel.

Language modeling is compression, but vision-language modeling is transmission.

Table of Content

1- Motivation example: when a witness is asked to describe a suspect

1.1 - A description setup

Imaging one has witnessed a serious incidence, and the police wants to know what the suspect looks like. They brought in a sketch artist to draw the suspect’s face, and ask the witness to describe the suspect to the artist.

Figure 1. Describing a suspect to a police, and the re-construction from the textual description. Image from The Day of the Jackal.

Essentially, the witness is transmitting the visual information using language. In this example, the description tries to establish one-to-one mappings between the image and textual space using words like “dense scattering of freckles”. The forensic artist reconstruct a sketch. This is a lossy reconstruction because the description cannot cover all details, but can cover the essential information. From an information theory perspective, the witness is the sender, the artist is the receiver, the language is the transmission channel, and the description is the transmitted bits.

1.2 - Using more tokens for better reconstruction

One can add new information to an image by using more tokens containing more information, as shown below.