The current ‘AI boom’ is driven by a key development - reformulating all kinds of NLP task as a single ‘supertask’ - chat dialogues.

What if we could extend this idea to images, audio and video? Instead of seperate encoders for each modality, let’s reframe all inputs as a single ‘supertask’, video prediction

The Idea

Today’s multimodal systems rely on separate encoders, custom architectures, and cross-modal fusion tricks. Instead, let’s turn all inputs: text, images, audio, video, into the same format: a sequence of visual frames. Then we can train a single model to do one thing: predict what frame comes next.

Everything is a video.

Let’s walk through each modality:

Text

For text, we render each token as a frame of the video, using a fixed-width font scaled to fill each frame:

Text as video

This means a sentence becomes a short video where each frame contains one token.

This setup works for:

  • classification
  • generation
  • question answering

Audio

Audio is first converted into spectrograms. These represent frequency content over time and are naturally visual.

Each slice of the spectrogram becomes a frame in the sequence.

This makes audio tasks compatible with the same next frame prediction objective.

Audio as video


Images

Images are the simplest case. A single image is just a frame.

From there, tasks like generation or inpainting become extensions of predicting future frames or filling in missing ones.

Image as video


Multimodal Inputs

Once everything is expressed as frames, combining modalities becomes straightforward.

You can concatenate sequences from different sources into one continuous stream.

For example:

  • video + text → text (video question answering)
  • video + audio → text (captioning)

Here is an example using CLEVRER, where a question is appended as rendered text frames after a video sequence.

Multimodal example


Intuition

The key idea is simple.

Instead of designing separate systems for each modality, we standardise the input format and train one model to handle everything.

The model does not need to know whether it is processing text, audio, or images. It only needs to understand how patterns evolve across frames.


Video Overview


Citation

@inproceedings{hudson2025everything,
  title={Everything is a video: Unifying modalities through next-frame prediction},
  author={Hudson, G Thomas and Slack, Dean and Winterbottom, Thomas and Sterling, Jamie and Xiao, Chenghao and Shentu, Junjie and Moubayed, Noura Al},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year = {2025}
}