Call for Papers · International Journal of Computer Vision

Multimodal Unified
Comprehension & Generation

An IJCV Special Issue on multimodal large language models that MUCG — jointly perceive, reason, and generate within a single cohesive architecture, closing the loop between understanding and synthesis.

01

Overview

Multimodal large language models now process images, video, audio, 3D data, and embodied sensory inputs. Crucially, these systems are evolving beyond the traditional silos of perception (classification, detection, grounding) and synthesis (image and video generation), and increasingly aim to unify comprehension and generation within a single, cohesive architecture.

We define this emerging paradigm as MUCG — Multimodal Unified Comprehension and Generation. Its defining characteristic is not modality or task coverage alone, but the alignment among perceptual grounding, internal reasoning, and generative outputs — heterogeneous inputs in, heterogeneous outputs out, evaluated in a closed loop. This Special Issue consolidates the conceptual, methodological, and evaluative foundations of this shift.

i

Task-Specific → Unified

From narrow task pipelines toward foundation models that understand, reason, and generate in a unified manner.

ii

Static → Interactive

Models deployed in embodied, interactive, and creative settings — comprehension guides generation; generation reveals comprehension.

iii

Isolated → Closed-Loop

New evaluation of understanding–generation consistency, grounded controllability, robustness, and reliability across tasks.

02

Aims & Scope

We seek principled frameworks, scalable learning paradigms, and reliable evaluation methodologies for unified multimodal systems. Topics of interest include, but are not limited to:

I

Unified Modeling Architectures

  • Single-backbone unified transformer frameworks
  • Encoder–LLM–decoder unified pipelines
  • Autoregressive–diffusion hybrid architectures
  • Tokenization & representation alignment across modalities
  • Structured intermediate representations (scene graphs, programs, plans)
II

Training & Alignment Strategies

  • Multi-task and curriculum learning strategies
  • Data mixture design for unified tasks
  • Instruction tuning & preference alignment
  • Reinforcement learning and feedback-driven refinement
  • Synthetic data generation & automatic annotation
III

Reasoning, Grounding & Controllability

  • Grounded multimodal reasoning
  • Planning-based generation
  • Closed-loop understanding–generation frameworks
  • Editable and controllable generation
  • Structured & explainable intermediate reasoning
IV

Evaluation & Benchmarks

  • Unified evaluation suites for comprehension & generation
  • Consistency and cycle-consistency metrics
  • Robustness, calibration, and reliability analysis
V

Efficiency & Scalability

  • Mixture-of-experts for multimodal systems
  • Efficient long-context modeling (video, multi-image, dialogue)
  • Compression, distillation, and deployment
VI

Applications

  • Embodied and robotic systems
  • Medical and scientific imaging
  • Remote sensing and industrial vision
  • Creative and interactive content systems
03

Submission & Reviewing

Prepare manuscripts according to the IJCV Submission Guidelines and select the article type “SI: Multimodal Unified Comprehension and Generation.” All papers are peer-reviewed following IJCV procedures by at least three independent reviewers.

Manuscripts must not be published or under review elsewhere. Submissions should demonstrate, in a cover letter, their relationship to the topic of this Special Issue. Papers receiving a Major Revision should be resubmitted within 3 months; Minor Revision within 1 month, with a detailed response to reviewers.

04

Important Dates

Time remaining until submission deadline

Days
:
Hours
:
Minutes
:
Seconds

The submission deadline has passed.

  1. 1 November 2026 Manuscript submission deadline
  2. 1 February 2027 First review notification
  3. 15 March 2027 Revised manuscript submission
  4. 1 May 2027 Final review notification
  5. May / June 2027 Special Issue publication

All deadlines are Anywhere on Earth (AOE, UTC−12).

05

Guest Editors

Mike Zheng Shou

National University of Singapore

Homepage ↗
06

FAQ

How and where do I submit?

Submit through the IJCV Editorial Manager and select the article type “SI: Multimodal Unified Comprehension and Generation.” Follow the standard IJCV author guidelines.

Is my paper a good fit for this Special Issue?

If your work advances unified multimodal architectures, training and alignment, reasoning and controllability, evaluation, efficiency, or applications — see the Aims & Scope above — it is likely in scope. When in doubt, contact the guest editors.

Will you accept conference extensions?

Yes. Conference-based extended papers are expected to have a minimum of 30% additional scientific contribution — for example, new or improved algorithms or analysis, new experiments, or qualitative/quantitative comparisons. As long as it is noted in the submission that the paper is extended from a conference paper, these should be fine.

Do I need to wait until the deadline to submit my paper?

No. You are welcome to submit your manuscript at any time before the deadline. Once your submission is received, we will initiate the review process as soon as possible.

What are the revision timelines?

Papers with a Major Revision decision should be resubmitted within 3 months; Minor Revision within 1 month. Revised submissions must include a detailed response to reviewers.