Everything is a Video

The current 'AI boom' is driven by a key development - reformulating all kinds of NLP task as a single 'supertask' – chat dialogues.

What if we could extend this idea to images, audio and video? Instead of seperate encoders for each modality, let's reframe all inputs as a single 'supertask' – video prediction.

How it Works

Today’s multimodal systems rely on separate encoders, custom architectures, and cross-modal fusion tricks. Instead, let's turn all inputs: text, images, audio, video, into the same format: a sequence of visual frames. Then we can train a single model to do one thing: predict what frame comes next.

Everything is a video.

A single interface. A single supertask. No special cases.

Let’s walk through each modality.

Text

For text, we render each token as a frame of the video, using a fixed-width font scaled to fill each frame:

This can also be applied to text generation, QA, and other text tasks.

Audio

For audio, we convert the sound into spectrograms — a visual stream of frequencies over time:

Images

We simply treat an image as a frame of the video:

Diagram explaining how image classification can be reformulated as video prediction by using image frame to create a video sequence. — Example of reformulating image classification (CIFAR-10) as video prediction

This can naturally be extended to image generation, inpainting, and other tasks.

Multimodal

We can combine any of the above modalities into a single video stream. For example, here's an example of formulating the CLEVRER video question answering (video + text -> text) task as video prediction:

Diagram explaining how the CLEVRER video question answering task can be reformulated as video prediction by rendering the question text as explained previously and appending it to the video sequence. — Example of reformulating Video Question Answering (CLEVRER) as video prediction

This can also be applied to other modality combinations such as video captioning (video + audio -> text)

BibTeX

@inproceedings{hudson2025everything,
  title={Everything is a video: Unifying modalities through next-frame prediction},
  author={Hudson, G Thomas and Slack, Dean and Winterbottom, Thomas and Sterling, Jamie and Xiao, Chenghao and Shentu, Junjie and Moubayed, Noura Al},
  booktitle      = {International Conference on Computer Vision (ICCV)},
  year           = {2025}
}