Everything is a Video

Unifying Modalities through Next-Frame Prediction

Durham University
ICCV 2025

The current 'AI boom' is driven by a key development - reformulating all kinds of NLP task as a single 'supertask' – chat dialogues.

What if we could extend this idea to images, audio and video? Instead of seperate encoders for each modality, let's reframe all inputs as a single 'supertask' – video prediction.

How it Works

Today’s multimodal systems rely on separate encoders, custom architectures, and cross-modal fusion tricks. Instead, let's turn all inputs: text, images, audio, video, into the same format: a sequence of visual frames. Then we can train a single model to do one thing: predict what frame comes next.

Everything is a video.

A single interface. A single supertask. No special cases.

Let’s walk through each modality.

Text

For text, we render each token as a frame of the video, using a fixed-width font scaled to fill each frame:

Diagram explaining how text classification can be reformulated as video prediction by rendering each token as a frame of the video using a fixed-width font.
Example of reformulating movie review classification (SST-2) as video prediction

This can also be applied to text generation, QA, and other text tasks.

Audio

For audio, we convert the sound into spectrograms — a visual stream of frequencies over time:

Diagram explaining how audio classification can be reformulated as video prediction by converting audio waveform to spectrogram and treating spectrogram frames as video frames.
Example of reformulating audio classification (AudioMNIST) as video prediction

Images

We simply treat an image as a frame of the video:

Diagram explaining how image classification can be reformulated as video prediction by using image frame to create a video sequence.
Example of reformulating image classification (CIFAR-10) as video prediction

This can naturally be extended to image generation, inpainting, and other tasks.

Multimodal

We can combine any of the above modalities into a single video stream. For example, here's an example of formulating the CLEVRER video question answering (video + text -> text) task as video prediction:

Diagram explaining how the CLEVRER video question answering task can be reformulated as video prediction by rendering the question text as explained previously and appending it to the video sequence.
Example of reformulating Video Question Answering (CLEVRER) as video prediction

This can also be applied to other modality combinations such as video captioning (video + audio -> text)

BibTeX

@inproceedings{hudson2025everything,
  title={Everything is a video: Unifying modalities through next-frame prediction},
  author={Hudson, G Thomas and Slack, Dean and Winterbottom, Thomas and Sterling, Jamie and Xiao, Chenghao and Shentu, Junjie and Moubayed, Noura Al},
  booktitle      = {International Conference on Computer Vision (ICCV)},
  year           = {2025}
}