Next Steps for AI Coding: Latest Review of "Multimodal Code Intellige…

A review paper by teams from Meituan, the University of Hong Kong, the Chinese University of Hong Kong, and their collaborators systematically sorted out the main tasks and bottlenecks of "Multimodal Code Intelligence," which can understand images, interfaces, and charts. They proposed 4 main directions for future research. The paper noted that, taking the IWR-Bench as an example, the visual fidelity of the current model can reach 64.25%, but the correct rate of interactive functions is only 24.39%. The evaluation of multimodal code intelligence should not only focus on visual similarity but also examine correctness at the semantic, structural, execution, and interaction levels. The research team summarized the tasks into two major categories: multimodal code synthesis and "code-centered reasoning and action." In the GUI direction, the closed-loop verification of web code generation is the clearest, but existing evaluations focus too much on static visual similarity. Due to the lack of a unified execution and interaction environment on mobile devices, it is more difficult to standardize the evaluation. In scientific visualization, the core requirement is that generated code must correctly render results and accurately express data semantics, document structures, or relevant scientific processes/mechanisms.