Abstract:To address the challenges that denoising diffusion models struggle to adapt to varying noise levels and conventional residual blocks have limited feature selection capability in image fusion tasks, this paper constructs a multimodal image fusion network integrating dynamic gating diffusion denoising and cross-layer attention. Firstly, four groups of expert convolution kernels are designed and incorporated into the dynamic feature extractor module. The optimal convolution kernels are dynamically assembled based on input content, enabling adaptive processing of input features. Secondly, an improved gated feature selection module is proposed to generate gating signals that suppress irrelevant information, enhance the model’s diffusion denoising capability under different noise levels, and achieve precise feature control. Finally, R-Transformer blocks are adopted for feature adjustment. A global-local spatial attention module is constructed to realize cross-layer feature fusion, thereby generating fused images with rich texture information and high color fidelity. Experimental results on the MSRS, RoadScene, and Harvard datasets demonstrate that compared with 9 representative state-of-the-art methods in the field of image fusion in recent years, the proposed method achieves an average improvement of 5.11% to 15.93% across 7 objective evaluation metrics. The proposed method outperforms other counterparts in texture detail preservation and anatomical structure integrity maintenance, conforms to human visual perception characteristics, and can effectively handle multimodal image fusion tasks in scenarios such as various lighting environments and medical image diagnosis.