MMaDA-Parallel: Multimodal Large Diffusion Language Models <br> for Thinking-Aware Editing and Generation

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

¹Peking University, ²ByteDance, ³Princeton University, ⁴CASIA, ⁵The University of Chicago

^*Equal Contribution, ^†Corresponding Authors

Abstract

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis.

Parallel Multimodal Diffusion Framework

Parallel Generation Architecture: Our framework represents all modalities as discrete tokens in an interleaved sequence with bidirectional attention. (a) During Training, image and text responses are masked and predicted in parallel with a uniform mask predictor. (b) During Sampling, the model performs parallel decoding to generate both image and text responses jointly, enabling continuous cross-modal interaction.

Parallel Reinforcement Learning (ParaRL)

Overview of Parallel Reinforcement Learning (ParaRL). Instead of optimizing only the final denoised outputs, ParaRL introduces dense reward signals along the entire denoising trajectory. We apply a Semantic Reward Function (R_t) at intermediate steps to reinforce semantic alignment (e.g., CLIP score) between the partially generated text and image, enforcing consistency throughout the generation process.

MMaDA-Parallel-M: Qualitative Results

Qualitative comparison with Bagel (w/ think) on MMaDA-Parallel-M. MMaDA-Parallel produces more precise and descriptive reasoning traces, which leads to superior visual fidelity. Our model accurately renders complex instructions like "three individuals" and "two clock faces" where Bagel often fails, demonstrating stronger compositional abilities.

MMaDA-Parallel-A: Qualitative Results

Qualitative comparison between MMaDA-Parallel-A and Bagel (w/ think), trained from Lumina-DiMOO. To validate the scalability of our method, we extend our post-training framework to Lumina-DiMOO, which shares a similar architecture with MMaDA but benefits from larger-scale data training. After applying our Parallel framework and ParaRL post-training, MMaDA-Parallel-A surpasses Bagel and achieves new state-of-the-art performance in thinking-aware synthesis, demonstrating the effectiveness and scalability of our approach.

Main Results on ParaBench

Main results on ParaBench. Evaluation across all editing and generation tasks. Our proposed method, MMaDA-Parallel (w/ ParaRL), achieves the highest Output Alignment (59.8) among all open-source models. This is a 6.9% improvement over the state-of-the-art model, Bagel (52.9), confirming the effectiveness of our parallel framework and trajectory-level optimization. MMaDA-Parallel-A achieves even better performance than Bagel (w/ think), demonstrating the scalability of our approach.

BibTeX

@article{tian2025mmadaparallel, title={MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation}, author={Tian, Ye and Yang, Ling and Yang, Jiongfan and Wang, Anran and Tian, Yu and Zheng, Jiani and Wang, Haochen and Teng, Zhiyang and Wang, Zhuochen and Wang, Yinjie and Tong, Yunhai and Wang, Mengdi and Li, Xiangtai}, journal={Preprint}, year={2025} }