IJCV Special Issue — Multimodal Unified Comprehension and Generation (MUCG)

01

Overview

Multimodal large language models now process images, video, audio, 3D data, and embodied sensory inputs. Crucially, these systems are evolving beyond the traditional silos of perception (classification, detection, grounding) and synthesis (image and video generation), and increasingly aim to unify comprehension and generation within a single, cohesive architecture.

We define this emerging paradigm as MUCG — Multimodal Unified Comprehension and Generation. Its defining characteristic is not modality or task coverage alone, but the alignment among perceptual grounding, internal reasoning, and generative outputs — heterogeneous inputs in, heterogeneous outputs out, evaluated in a closed loop. This Special Issue consolidates the conceptual, methodological, and evaluative foundations of this shift.

i

Task-Specific → Unified

From narrow task pipelines toward foundation models that understand, reason, and generate in a unified manner.

ii

Static → Interactive

Models deployed in embodied, interactive, and creative settings — comprehension guides generation; generation reveals comprehension.

iii

Isolated → Closed-Loop

New evaluation of understanding–generation consistency, grounded controllability, robustness, and reliability across tasks.

02

Aims & Scope

We seek principled frameworks, scalable learning paradigms, and reliable evaluation methodologies for unified multimodal systems. Topics of interest include, but are not limited to:

I

Unified Modeling Architectures

Single-backbone unified transformer frameworks
Encoder–LLM–decoder unified pipelines
Autoregressive–diffusion hybrid architectures
Tokenization & representation alignment across modalities
Structured intermediate representations (scene graphs, programs, plans)

II

Training & Alignment Strategies

Multi-task and curriculum learning strategies
Data mixture design for unified tasks
Instruction tuning & preference alignment
Reinforcement learning and feedback-driven refinement
Synthetic data generation & automatic annotation

III

Reasoning, Grounding & Controllability

Grounded multimodal reasoning
Planning-based generation
Closed-loop understanding–generation frameworks
Editable and controllable generation
Structured & explainable intermediate reasoning

IV

Evaluation & Benchmarks

Unified evaluation suites for comprehension & generation
Consistency and cycle-consistency metrics
Robustness, calibration, and reliability analysis

V

Efficiency & Scalability

Mixture-of-experts for multimodal systems
Efficient long-context modeling (video, multi-image, dialogue)
Compression, distillation, and deployment

VI

Applications

Embodied and robotic systems
Medical and scientific imaging
Remote sensing and industrial vision
Creative and interactive content systems

03

Submission & Reviewing

Prepare manuscripts according to the IJCV Submission Guidelines and select the article type “SI: Multimodal Unified Comprehension and Generation.” All papers are peer-reviewed following IJCV procedures by at least three independent reviewers.

Manuscripts must not be published or under review elsewhere. Submissions should demonstrate, in a cover letter, their relationship to the topic of this Special Issue. Papers receiving a Major Revision should be resubmitted within 3 months; Minor Revision within 1 month, with a detailed response to reviewers.

Open Editorial Manager → Author Guidelines

04

Important Dates

Time remaining until submission deadline

—Days

:

—Hours

:

—Minutes

:

—Seconds

The submission deadline has passed.

1 November 2026 Manuscript submission deadline
1 February 2027 First review notification
15 March 2027 Revised manuscript submission
1 May 2027 Final review notification
May / June 2027 Special Issue publication

All deadlines are Anywhere on Earth (AOE, UTC−12).

05

Guest Editors

Hao Fei

University of Oxford

Homepage ↗

Shengqiong Wu

University of Oxford

Homepage ↗

Xiaohan Wang

Stanford University

Homepage ↗

Mike Zheng Shou

National University of Singapore

Homepage ↗

Ziwei Liu

Nanyang Technological University

Homepage ↗

Jianfei Cai

Monash University

Homepage ↗

Lu Jiang

Apple

Homepage ↗

Yong Jae Lee

UW–Madison & Adobe Research

Homepage ↗

Mohit Bansal

UNC Chapel Hill

Homepage ↗

Ming-Hsuan Yang

UC Merced

Homepage ↗

06

FAQ

How and where do I submit?

Submit through the IJCV Editorial Manager and select the article type “SI: Multimodal Unified Comprehension and Generation.” Follow the standard IJCV author guidelines.

Is my paper a good fit for this Special Issue?

If your work advances unified multimodal architectures, training and alignment, reasoning and controllability, evaluation, efficiency, or applications — see the Aims & Scope above — it is likely in scope. When in doubt, contact the guest editors.

Will you accept conference extensions?

Yes. Conference-based extended papers are expected to have a minimum of 30% additional scientific contribution — for example, new or improved algorithms or analysis, new experiments, or qualitative/quantitative comparisons. As long as it is noted in the submission that the paper is extended from a conference paper, these should be fine.

Do I need to wait until the deadline to submit my paper?

No. You are welcome to submit your manuscript at any time before the deadline. Once your submission is received, we will initiate the review process as soon as possible.

What are the revision timelines?

Papers with a Major Revision decision should be resubmitted within 3 months; Minor Revision within 1 month. Revised submissions must include a detailed response to reviewers.

Multimodal Unified Comprehension & Generation

Overview

Task-Specific → Unified

Static → Interactive

Isolated → Closed-Loop

Aims & Scope

Unified Modeling Architectures

Training & Alignment Strategies

Reasoning, Grounding & Controllability

Evaluation & Benchmarks

Efficiency & Scalability

Applications

Submission & Reviewing

Important Dates

Guest Editors

Hao Fei

Shengqiong Wu

Xiaohan Wang

Mike Zheng Shou

Ziwei Liu

Jianfei Cai

Lu Jiang

Yong Jae Lee

Mohit Bansal

Ming-Hsuan Yang

FAQ

Multimodal Unified
Comprehension & Generation