We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating images, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Training with such diverse modalities is extremely difficult, we propose various architectural improvements to stabilize the model. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objectives. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune an ensemble of 120 existing datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 30 benchmarks, including image generation and understanding, text understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.
(Click on images to zoom it in)
@article{lu2023uio2,
title = {Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action},
author = {Jiasen Lu and Christopher Clark and Sangho Lee and Zichen Zhang and Savya Khosla and Ryan Marten and Derek Hoiem and Aniruddha Kembhavi},
journal = {arXiv preprint arXiv:2312.17172},
year = {2023},
}