ICVISP 2025-Session Ⅷ

Advanced Scene Understanding Methods and Applications in Multi-Modal Data (Submission Deadline: Nov. 26, 2025)

面向多模态数据的场景理解方法及应用

Chair:	Co-chair:

Weiyao Lin	Xiankai Lu
Shanghai Jiaotong University, China	Shandong University, China

Keywords:

Topic:

Multimodal Data Representation and Modeling
(多模态数据表征与建模)
Large Multimodal Model and Application
(多模态大模型及应用)
Multimodal Scene Understanding and Inference
(多模态场景理解与推理)
Multimodal Data Alignment and Fusion
(多模态数据对齐与融合)
Pattern Recognition Based on Text-Video/Text-Image
(文本-视频/文本-图像等跨模态理解)
Object Segmentation, Detection and Recognition Based on 2D,3D and Ego-Exocentric
Video Data
(2D、3D、第一视角、第三视角视频场景分割、检测与识别)

Advanced Scene Understanding Methods and Applications in Multi-Modal Data
(面向多模态数据的场景理解方法及应用)

Summary:

Deep learning-based intelligent algorithms have demonstrated remarkable versatility and prowess across various domain-specific tasks. Despite these significant achievements, existing unimodal models still exhibit limitations in meeting the diverse requirements of daily-life applications. This has spurred researchers to delve into the field of multimodal data pattern recognition, where models exemplified by Clip have significantly enhanced multimodal scene understanding capabilities. More recently developed Large Multimodal Models (LMMs) such as Gemini (Google) and Sora (OpenAI) further showcase powerful abilities in comprehending or creating realistic and imaginative videos. Although deep learning-based multimodal algorithms have garnered widespread attention, they face numerous challenges when processing dynamic visual scenes. These include: Integrating and aligning multimodal information (e.g., video, audio, 3D data) Addressing domain shift issues Handling noisy data and labeling defects Discovering novel objects or patterns Furthermore, infusing temporal consistency and coherence properties into these algorithms poses a significant challenge for understanding multimodal scenes. This special session aims to provide a platform for researchers to share the latest advances in multimodal model theories, methodologies, and applications. We also cordially invite submissions exploring the potential of multimodal data in enhancing the diversity and inclusivity of scene understanding.
基于深度学习的智能算法已在各领域任务中展现出卓越的通用能力。尽管成就显著，现有单模态模型在日常生活应用的广泛需求方面仍存在局限。这促使研究者深入探索多模态数据的模式识别领域，以Clip为代表的多模态模型显著提升了多模态场景理解能力。而近期问世的Gemini（谷歌）和Sora（OpenAI）等LMMs，更展现出理解或创作逼真且富有想象力视频的强大能力。尽管基于深度学习的多模态算法已获得广泛关注，其在处理动态视觉场景时仍面临诸多挑战：包括视频、声音、三维数据等多模态信息的融合对齐、领域偏移问题、噪声数据与标注缺陷处理、以及新物体或模式的发现。此外，为理解多模态场景，如何将时间一致性与连贯性融入构成重大挑战。本期特刊旨在为研究者提供分享多模态模型理论、方法以及应用最新进展的平台。同时诚邀探索多模态数据在提升场景理解的多样性与包容性方面潜力的研究成果投稿。