Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Unified-IO 2: Scaling Autoregressive Multimodal Models
with Vision, Language, Audio, and Action

¹Allen Institute for AI; ²University of Illinois Urbana-Champaign; ³University of Washington

^*Leading Authors, equal contribution.

Abstract

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating images, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Training with such diverse modalities is extremely difficult, we propose various architectural improvements to stabilize the model. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objectives. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune an ensemble of 120 existing datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 30 benchmarks, including image generation and understanding, text understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Qualitative Results

Unified-IO 2 can perform a multitude of multimodal tasks: captioning the image, following free-form instructions, image editing, object detection, semantic segmentation, surface normal, and image-based audio generation, etc. Here, we show the outputs of our model for a variety of prompts.

(Click on images to zoom it in)

Natural Language

Image Generation

Audio Generation

Image Understanding

Video Understanding

Audio Understanding

Image Sparse Labeling

Image Dense Labeling

Embodied AI & 3D

BibTeX

@article{lu2023uio2, title = {Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action}, author = {Jiasen Lu and Christopher Clark and Sangho Lee and Zichen Zhang and Savya Khosla and Ryan Marten and Derek Hoiem and Aniruddha Kembhavi}, journal = {arXiv preprint arXiv:2312.17172}, year = {2023}, }

Unified-IO 2: Scaling Autoregressive Multimodal Modelswith Vision, Language, Audio, and Action