Recently, the International Telecommunication Union (ITU-T) officially approved the release of the international standard for multimodal large models, ITU-TF.748.72 "Requirements and Framework for Multimodal Generative AI Empowered Multi View Conversion Systems", led by Alibaba Cloud. This standard systematically defines for the first time the overall architecture, core functions, and application requirements of a multimodal large model based multi view transformation (MEMVT) system, providing a unified technical specification for the industry, effectively solving key problems such as spatial perception errors, target loss, and inaccurate predictions caused by perspective occlusion, sensor loss, or information fragmentation, and accelerating the large-scale implementation of multimodal large models in industrial scenarios.

Cracking the bottleneck of spatial perception in complex scenes
Currently, scenarios such as smart highways, automated ports, and high-level autonomous driving heavily rely on precise understanding of physical space. Traditional multi view conversion systems are usually based on convolutional neural networks (CNNs), which are limited by narrow sensing coverage and lack of contextual reasoning ability, making it difficult to cope with complex working conditions such as spatial occlusion, dense targets, low nighttime illumination, sensor failures, etc., resulting in unstable perception results and insufficient decision-making reliability.
The emergence of multimodal large models provides a new path for this problem. With its powerful cross modal fusion and generation capabilities, the MEMVT system can simultaneously process various heterogeneous data such as images, videos, LiDAR point clouds, millimeter wave radars, high-precision maps, etc. It learns the motion patterns and spatial semantics of targets through massive training, intelligently "completes" occluded areas, repairs missing information, and generates high fidelity, high consistency unified perspectives (such as bird's-eye view BEVs), significantly improving the integrity, robustness, and accuracy of spatial perception.
Standardization Framework: From Basic Abilities to Industrial Applications

The ITU-TF.748.72 standard proposes that the MEMVT system consists of three core modules: a multi view source encoder, a view transformation encoder, and a multi task decoder. During the training phase, the system extracts features from general, empirical, and feedback data, generates standardized single view tokens, and maps them to multi view representations in a unified space; In the inference phase, two levels of capabilities are implemented based on this:
/Basic abilities: information compensation, multi perspective fusion and completion, multimodal temporal fusion.
/Application capabilities: target tracking and decision assistance, panoramic visualization, behavior prediction, target analysis enhancement, automatic generation of simulation scene library, real-time vehicle collaborative control optimization, etc.
In addition, the standard also specifies the evaluation indicators, management mechanisms, and service interface requirements of the system, ensuring that the technology is verifiable, operable, and scalable.
One of the main drafters of this standard, Liu Yanbin from Alibaba Cloud, stated: "Currently, domestic and foreign highway operators, port enterprises, logistics service providers, and traffic management departments are generally facing problems such as inconsistent architecture and blurred functional boundaries when building multimodal perception systems. F.748.72, as the world's first international standard focusing on multimodal large model spatial perception, not only fills the technical gap, but also provides a 'construction guide' for the industry, which will effectively promote the dual improvement of perception quality and application efficiency. ”
Application prospects: From transportation infrastructure to various industries

With the release and promotion of this standard, MEMVT technology is expected to unleash value in a wider range of fields:
/Smart transportation: Build a blind spot free intersection perception system to support integrated vehicle road cloud collaborative decision-making, improve traffic efficiency and proactive safety level;
/Automated Port: Achieve full lifecycle tracking of containers, enabling precise positioning and scheduling even in stacked and obstructed scenarios;
/Electricity and energy inspection: Through visual and point cloud integration, automatic identification of equipment defects and calculation of personnel safety distance are carried out to ensure operational safety;
/Urban governance: Integrating multi-source perception data, building a city level digital twin base, supporting emergency response, crowd diversion, and facility management;
/Medical imaging: cross modal fusion of CT, MRI, ultrasound and other data to assist doctors in 3D lesion reconstruction and surgical planning;
/Industrial manufacturing: Achieving multi angle accurate recognition and pose estimation of components in flexible production lines to improve the success rate of robot grasping.
In the future, Alibaba Cloud will continue to collaborate with industry, academia, and research parties to promote the implementation of MEMVT technology and standards in more countries and industries, helping the world move towards a new stage of digital intelligence with "global perception, AI decision-making, and efficient collaboration".
