The current 'AI boom' is driven by a key development - reformulating all kinds of NLP task as a single 'supertask' – chat dialogues.
What if we could extend this idea to images, audio and video? Instead of seperate encoders for each modality, let's reframe all inputs as a single 'supertask' – video prediction.
Today’s multimodal systems rely on separate encoders, custom architectures, and cross-modal fusion tricks. Instead, let's turn all inputs: text, images, audio, video, into the same format: a sequence of visual frames. Then we can train a single model to do one thing: predict what frame comes next.
Everything is a video.
A single interface. A single supertask. No special cases.
Let’s walk through each modality.
For text, we render each token as a frame of the video, using a fixed-width font scaled to fill each frame:
This can also be applied to text generation, QA, and other text tasks.
For audio, we convert the sound into spectrograms — a visual stream of frequencies over time:
We simply treat an image as a frame of the video:
This can naturally be extended to image generation, inpainting, and other tasks.
We can combine any of the above modalities into a single video stream. For example, here's an example of formulating the CLEVRER video question answering (video + text -> text) task as video prediction:
This can also be applied to other modality combinations such as video captioning (video + audio -> text)
@inproceedings{hudson2025everything,
title={Everything is a video: Unifying modalities through next-frame prediction},
author={Hudson, G Thomas and Slack, Dean and Winterbottom, Thomas and Sterling, Jamie and Xiao, Chenghao and Shentu, Junjie and Moubayed, Noura Al},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2025}
}