MUCG @ ACM MM 2025


The 1st International Workshop on MLLM for Unified Comprehension and Generation


Dublin, Ireland

October 27

Workshop Introduction

As Multimodal Large Language Models (MLLMs) continue to advance, there is a growing need to bridge the gap between their comprehension and generation capabilities within unified frameworks. This workshop MLLM for Unified Comprehension and Generation (MUCG) aims to explore and address the fundamental challenges in developing truly integrated MLLMs that can seamlessly understand and create multimodal content. We focus on three interconnected areas:

  • Sophisticated multimodal comprehension, targeting robust understanding of complex visual content and semantic relationships.
  • Controllable content generation, addressing challenges in high-fidelity synthesis and cross-modal consistency.
  • Unified frameworks that enable semantic alignment between understanding and generation tasks.

Unlike previous approaches that treat these capabilities separately, our workshop specifically targets their integration through MLLMs, fostering focused discussions on shared architectures, bidirectional knowledge transfer, and end-to-end training strategies.

WeChat Group

Scan to join our WeChat group

WeChat Group

Scan to join our WeChat group

Call4Paper Topics

We welcome paper submissions on all topics related to unified MLLMs, including but not limited to:

MLLM for Multimodal Comprehension & Reasoning

  • Single/Multiple Image Understanding
  • Short/Long Video Understanding
  • 3D Scene/Object Comprehension
  • Visual Document Understanding
  • Multi-view Scene Analysis
  • Complex Visual Reasoning
  • Temporal-Spatial Understanding
  • Cross-modal Knowledge Extraction
  • Visual Relationship Detection
  • Visual Question Answering
  • Visual Information Retrieval
  • Scene Graph Understanding
  • Cross-modal/Interleaved Reasoning
  • Multimodal Chain-of-Thought Reasoning

MLLM for Multimodal Content Generation

  • Text-to-Image/Video Synthesis
  • 3D Content Generation
  • Motion Sequence Generation
  • Visual Story Generation
  • Multi-image Coherent Generation
  • Auto-regressive Visual Generation
  • Layout-to-Image Generation
  • Cross-modal Style Transfer
  • Visual Content Editing
  • Multimodal Dialogue Generation
  • Sequential Image Generation
  • Conditioned Visual Generation

Unified MLLM Understanding and Generation

  • Unified Encoder-Decoder Frameworks
  • Joint Vision-Language Models
  • Agentic System
  • Autoregressive/Transfusion LLMs
  • Multi-task Learning Strategies
  • Cross-task Knowledge Transfer
  • Shared Representation Learning
  • Vision-Language Alignment
  • Multi-modal Fusion Methods
  • End-to-end Training Approaches
  • Instruction Tuning for Unification
  • Unified Tokenization Strategies
  • Bidirectional Generation Methods

Schedule

We plan to hold a hybrid format of workshop, i.e., both onsite and online. For the onsite type, at least three organizers will attend in person to host the workshop. The workshop will include two major activities, the invited keynotes, and the paper presentations. We invite keynote presentations, followed by accepted workshop presentations.

Our workshop will be held at morning of the 27th, at Royal CC, Higgins 2 room. The session runs from 09:00 to 12:30 in the Dublin Royal Convention Centre.

Time Session Presenter
9:00-9:05 Opening Organizer
9:05-9:45 Keynote Talk - I: Multimedia Analytics: Bringing together images, text, metadata, relations, knowledge, and users Prof Marcel Worring
9:45-10:30 Keynote Talk - II: Beyond Text Prompts: Consistency and Physics in Visual Generation Prof Jianfei Cai
Coffee Break, Poster Session
11:00-11:15 Oral - I: Boosting Temporal Sentence Grounding via Causal Inference Kefan Tang
11:15-11:30 Oral - II: HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation Weihuang Lin
11:30-11:45 Oral - III: FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents Bobo Li
11:45-12:00 Oral - IV: See, Localize and Verify: A GRPO-Powered Framework for Enhancing Factual Accuracy in Multimodal Models Xuan Li
12:00-12:15 Oral - V: Agent Network for Multimodal Video Understanding Meiyi Lu
12:15-12:30 Oral - VI: MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models Jiale Li
12:30-13:15 Keynote Talk - III (Indeterminate): Multimodal LLMs as Social Media Analysis Engines Prof Jiebo Luo

Keynote Speakers

We will have following invited speakers to give keynote talk!

Prof Marcel Worring

University of Amsterdam
Keynote Title:

Multimedia Analytics: Bringing together images, text, metadata, relations, knowledge, and users

Abstract:

Multimodal Language Models have blurred the distinction among modalities, yet the interaction with the models is still focused on conversations based on textual prompts. Visual Analytics on the other hand is a field that has its main focus on providing visual means to help users in understanding large data collections. Multimedia Analytics brings those two worlds together. In this talk we present a new model for multimedia analytics and illustrate its use by considering systems we have recently developed for understanding art in context and for incident investigation and how the model could help in bringing these applications to the next level. We conclude by considering some major research challenges that need to be addressed in realizing the proposed model.

Bio:

Marcel Worring is a full professor in the Informatics Institute of the University of Amsterdam where he is leading the MultiX research group. The focus of the group is developing techniques for bringing together humans and multimodal AI, surpassing human and machine intelligence for responsible impact in public health, forensics and law enforcement, cultural heritage, and data-driven business. He is co-founder of the Innovation Center for AI and a fellow of ELLIS. He was co-chair of the ACM Multimedia 2016 and MMM 2024, and will be for ICMR 2026, all in Amsterdam. He has been associate editor of ACM TOMM, IEEE Transaction on Multimedia, and IEEE Multimedia.

Prof Jianfei Cai

Monash University
Keynote Title:

Beyond Text Prompts: Consistency and Physics in Visual Generation

Abstract:

Recent advances in large language models (LLMs) and multimodal large language models (MLLMs) have significantly enhanced the understanding and encoding of textual information. Leveraging these capabilities, a growing number of diffusion-based generative models have emerged for text-conditioned visual generation — spanning text-to-image, text-to-video, and text-to-3D tasks. While these models offer remarkable flexibility and produce increasingly realistic content, they still face fundamental challenges: aligning precisely with user intent, maintaining spatial, view, and temporal consistency, and adhering to the laws of physics. In this talk, I will present several recent research projects from my group that attacks these challenges. PanFusion enforces global consistency in text-to-panorama image generation; MVSplat360 uses image conditions and explicit 3D representation to enhance view consistency of 3D generation. VLIPP integrates physics-informed priors to ensure physically plausible text-to-video generation. I will conclude by pointing out the limitations and discussing future directions such as developing world models.

Bio:

Jianfei Cai is a Professor at Faculty of IT, Monash University, where he had served as the inaugural Head for the Data Science & AI Department. Before that, he was Head of Visual and Interactive Computing Division and Head of Computer Communications Division in Nanyang Technological University (NTU). His major research interests include computer vision, deep learning and multimedia. He is a co-recipient of paper awards in ACCV, ICCM, IEEE ICIP and MMSP, and a winner of Monash FIT’s Dean's Researcher of the Year Award and Monash FIT Dean's Award for Excellence in Graduate Research Supervision. He serves or has served as an Associate Editor for TPAMI, IJCV, IEEE T-IP, T-MM, and T-CSVT as well as serving as Senior/Area Chair for CVPR, ICCV, ECCV, ACM Multimedia, ICLR and IJCAI. He was the Chair of IEEE CAS VSPC-TC during 2016-2018. He had served as the leading TPC Chair for IEEE ICME 2012, the best paper award committee chair & co-chair for IEEE T-MM 2020 & 2019, and the leading General Chair for ACM Multimedia 2024. He is a Fellow of IEEE.

Prof Jiebo Luo

University of Rochester
Keynote Title:

Multimodal LLMs as Social Media Analysis Engines

Abstract:

Recent research has offered insights into the extraordinary capabilities of Multimodal Large Multimodal Models (MLMMs) in various general vision and language tasks. There is growing interest in how MLMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. To effectively understand such content, models need to interpret the intricate interactions between these diverse communication modalities and their impact on the conveyed message. Understanding social multimedia content remains a challenging problem for contemporary machine learning frameworks. To evaluate MLLMs\' capabilities for social multimedia analysis, we select five representative tasks, including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection. Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V\’s potential in understanding multimodal social media content. GPT-4V demonstrates remarkable efficacy in these tasks, showcasing strengths such as joint understanding of image-text pairs, contextual and cultural awareness, and extensive commonsense knowledge. In addition to the known hallucination problem, notable challenges remain as GPT-4V struggles with tasks involving multilingual social multimedia comprehension and has difficulties in generalizing to the latest trends in social media. We further present several attempts to improve the performance on some tasks. The insights gleaned from our findings underscore a promising future for MLMMs in enhancing our understanding of social media content and its users through the analysis of multimodal information.

Bio:

Jiebo Luo is the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester, which he joined in 2011 after a prolific career of fifteen years at Kodak Research Laboratories. He has authored over 600 technical papers and holds over 90 U.S. patents. His research interests include computer vision, NLP, machine learning, data mining, computational social science, and digital health. He has been involved in numerous technical conferences, including serving as program co-chair of ACM Multimedia 2010, IEEE CVPR 2012, ACM ICMR 2016, and IEEE ICIP 2017, and general co-chair of ACM Multimedia 2018 and IEEE ICME 2024. He has served on the editorial boards of the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), IEEE Transactions on Multimedia (TMM), IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), IEEE Transactions on Big Data (TBD), ACM Transactions on Intelligent Systems and Technology (TIST), Pattern Recognition, Knowledge and Information Systems (KAIS), Machine Vision and Applications (MVA), and Intelligent Medicine. He was the Editor-in-Chief of the IEEE Transactions on Multimedia (2020-2022). Professor Luo is a Fellow of ACM, AAAI, IEEE, SPIE, and IAPR, as well as a Member of Academia Europaea and the US National Academy of Inventors (NAI). Professor Luo received the ACM SIGMM Technical Achievement Award in 2021 and the William H. Riker University Award for Excellence in Graduate Teaching in 2024.

Paper Submission and Reviewing

Submission Types

  • Technical Papers

    Original research contributions that present novel ideas, methods, or results in the area of multimodal large language models for unified comprehension and generation. These papers should not exceed 8 pages (plus unlimited pages for references).

  • Perspective Papers

    Position papers that discuss new perspectives, challenges, or future directions in the field. These should also be up to 8 pages in length.

  • Demonstration Papers

    Descriptions of working systems, tools, or demonstrations that showcase innovative applications of MLLMs. These papers should be up to 4 pages long.

  • Extended Abstracts

    Non-archival extended abstracts of previously published work or work in progress. These should be up to 2 pages in length.

Important Guidelines

  • Formatting

    Submitted papers (.pdf format) must use the ACM Article Template. Please remember to add Concepts and Keywords. Please use the template in traditional double-column format to prepare your submissions. For example, Word users should use the Word Interim Template, and LaTeX users should use sample-sigconf-authordraft template. When using the sample-sigconf-authordraft template, for submission and review of the manuscript you could use the following documentclass command instead of the example provided in the template: \documentclass[sigconf, screen, review, anonymous]{acmart} Please ensure that you submit your papers subscribing to this format for full consideration during the review process.

  • Double-Blind Review

    Submissions must be anonymized to ensure a fair review process. Authors should not identify themselves in the paper.

  • Supplementary Material

    Authors may submit supplementary material (e.g., code, data, videos) up to 100MB. This should be referenced in the paper but not included in the page limit.

We will select from the accepted papers the Best Paper Award, which will be announced during the workshop.

Reviewing Process

All submissions will undergo a rigorous double-blind peer review process. Each paper will be evaluated by at least three members of the program committee based on:

  • Relevance to MLLM for unified comprehension and generation
  • Originality and significance of contributions
  • Technical quality and depth
  • Clarity of presentation
  • Potential impact on the field

Workshop Important Dates

  • Paper Submission Start: 08 April, 2025 (AoE)
  • Paper Submission Deadline: 27 July, 2025 (AOE)
  • Notification of Acceptance: 03 August, 2025 (AoE)
  • Camera-ready Submission: 11 August, 2025 (AoE)
  • Workshop dates: 27-28 October, 2025 (AoE)

Challenge

🏆 General-Level of Multimodal Generalist

We are excited to announce the General-Level of Multimodal Generalist Challenge, hosted in conjunction with the MUCG'25 workshop. This open challenge invites researchers and practitioners to develop and evaluate MLLMs/Agents that demonstrate generalist capabilities across diverse tasks and modalities.

leaderboard-scope

The challenge is based on the General-Level platform, which provides a comprehensive evaluation suite—General-Level; and an extensive benchmark—General-Bench—for assessing the generalization abilities of MLLMs. Participants will have the opportunity to test their models on a wide range of tasks that require unified comprehension and generation across multiple modalities.

🎯 Challenge Tracks

leaderboard-scope

The challenge comprises four distinct tracks, each focusing on different aspects of multimodal generalization:

  • 👑 Scope-A: Full-spectrum Hero: Full-spectrum leaderboard covering all modalities and tasks under General-Level, for highly capable, general-purpose multimodal models.
  • 💎 Scope-B: Modality-specific Unified Hero: Modality-specific leaderboards focusing on single modality or partially joint modality (e.g., image, video, audio, 3D) for modality-wise generalists.
  • 💪 Scope-C: Comprehension/Generation Hero: Leaderboards categorized by comprehension vs. generation tasks within each modality. Lower entry barrier for early-stage or lightweight models.
  • 🛠️ Scope-D: Skill-specific Hero: Fine-grained leaderboards focused on specific task clusters (e.g., VQA, Captioning, Speech Recognition), ideal for partial generalists.

Each track is designed to evaluate specific capabilities of MLLMs, encouraging the development of models that can generalize effectively across different types of tasks and data.

📅 Challenge Important Dates

  • Challenge Registration Start: 20 May, 2025 (AoE)
  • Challenge Registration End: 10 July, 2025 (AoE)
  • Notification of Rankings: 30 July, 2025 (AoE)
  • Challenge Submission Deadline: 5 August, 2025 (AOE)
  • Paper Camera-ready Submission: 11 August, 2025 (AoE)
  • Workshop dates: 27-28 October, 2025 (AoE)

📝 Participation Guidelines

  • Registration: Interested participants should register by filling out the Google Form
  • Data and Tools: Go to the official Leaderboard site for obtaining the datasets as well as the evaluation suite.
  • Submission: Participants must submit their model predictions at Submit Page by the submission deadline for evaluation.

🏅 Awards and Recognition

Top-performing teams in each track will receive Cash awards and Certificate. Outstanding teams will be invited to write technical paper for inclusion in the MUCG'25 workshop proceedings, and present it at the workshop.

Learn more about the challenge by the Multimodal-Generalist website and the paper.

For further inquiries, please contact the challenge organizers at mugc-workshopmm25@googlegroups.com.

Organization Team

Program Committee

Jiayi Ji

National University of Singapore

Hao Fei

National University of Singapore

Gen Luo

Shanghai Artificial Intelligence Laboratory

Yaoting Wang

University of Edinburgh

Liang Zheng

Australian National University

Chia-Wen Lin

National Tsing Hua University

Shuicheng Yan

National University of Singapore

Rongrong Ji

Xiamen University

Tat-Seng Chua

National University of Singapore

Challenge Committee

Shengqiong Wu

National University of Singapore

Jinfa Huang

University of Rochester

Daoan Zhang

University of Rochester

Program Committee

Steering Committee:
  • Xuying Zhang (Nankai University)
  • Yiwei Ma (Xiamen University)
Program Committee:
  • Wei Ji (Nanjing University)
  • Zhenglin Zhou (Zhejiang University)
  • Changli Wu (Xiamen University)
  • Weihuang Lin (Xiamen University)
  • Qi Chen (Xiamen University)
  • Lvpan Cai (Xiamen University)
  • Ke Ye (Xiamen University)
  • Sunhao Dai (Renmin University of China)

Contact

Join and post at our Google Group!
Email the organizers at mugc-workshopmm25@googlegroups.com.