用于自监督表征学习的教师学生互补掩码自动编码器

doi:10.11918/202302029

首页 > 过刊浏览>2026年第58卷第3期 >74-87. DOI:10.11918/202302029

用于自监督表征学习的教师学生互补掩码自动编码器
DOI:
                        10.11918/202302029
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:(1.武汉理工大学 计算机与人工智能学院,武汉 430063; 2.武汉理工大学 智能交通系统研究中心, 武汉 430063; 3.新一代人工智能技术应用交通运输行业研发中心, 杭州 310013)
作者简介:黄靖(1977—),男,副教授,硕士生导师； 叶少雄(1999—),男,硕士研究生；文元桥(1975—), 男, 教授, 博士生导师; 朱立夫(1999—), 男, 硕士研究生; 黄亚敏(1990—), 男, 研究员, 博士生导师
通讯作者:黄靖, huangjing@whut.edu.cn
中图分类号:TP399
基金项目:国家自然科学基金资助项目 (52072287)；浙江省科技计划项目 (2021C01010)；新一代人工智能技术应用交通运输行业研发中心开放基金(202302H)；浙江省交通厅科技项目(2024006)

Teacher-student complementary mask autoencoder for self-supervised representation learning

Author:

Affiliation:

(1.School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430063, China; 2.Intelligent Transportation System Research Center, Wuhan University of Technology, Wuhan 430063, China; 3.Research and Development Center of Transport Industry of New Generation of Artificial Intelligence Technology, Hangzhou 310013, China)

Fund Project:

undefined

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对自监督表征学习中掩码图像建模(MIM)方法存在上下游任务不匹配的问题,提出了一种称为教师学生互补掩码自动编码器的新预训练模型,即TSCAE模型。该模型由具备互补掩码机制的教师模块和学生模块组成,其中教师模块基于Transformer结构,负责预测图像中掩码区域（如随机掩蔽输入图片的75%部分）；学生模块则采用单一的编码器结构预测同一图像中剩余区域（如掩蔽输入图片余下的25%部分）。为从大量无标签数据中预训练出更丰富的视觉表征,TSCAE模型同时完成两类上游任务,分别是预测任务和对比任务,并在COCO和Tiny-ImageNet数据集上完成预训练。测试结果表明,在包括VOC在内的3个公有数据集和2个私有数据集上,TSCAE在图像分类、目标检测和语义分割等下游任务中,性能均优于经典的掩码自编码器(MAE)。特别地,TSCAE还在一定程度上缓解了预训练图像质量对视觉表征学习编码器的影响。

Abstract:

To address the problem of mismatch between upstream and downstream tasks exhibited by masked image modeling (MIM) methods in self-supervised representation learning, we proposed a novel pre-training model, called teacher-student complementary masked autoencoder, or in other words, the TSCAE model. The TSCAE model consists of two modules with complementary masked mechanisms, called teacher module and student module, respectively. The teacher module was designed as a Transformer-based structure to predict the masked region of an image (e.g., randomly masking 75% of the input image), while the student module employed a sole encoder to predict the remaining region of the same image (e.g., masking the remaining 25% of the input image). Meanwhile, to attain a richer visual representation from a large number of unlabeled data, the TSCAE model completed two kinds of upstream tasks, namely prediction and contrastive tasks. After that, the TSCAE model achieved the pre-training on COCO and Tiny-ImageNet datasets. The results demonstrate that across three public datasets including VOC and two private datasets, the proposed TSCAE model achieves better performance than the classical masked autoencoder (MAE) methods on downstream tasks such as image classification, object detection, and semantic segmentation. In particular, the TSCAE also alleviates the impact of the quality of the pre-training images on the visual representation learning encoder to a certain extent.

参考文献

相似文献

引证文献

引用本文

黄靖,叶少雄,文元桥,朱立夫,黄亚敏.用于自监督表征学习的教师学生互补掩码自动编码器[J].哈尔滨工业大学学报,2026,58(3):74. DOI:10.11918/202302029

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-02-16
最后修改日期:
录用日期:
在线发布日期: 2026-03-31
出版日期:

出版声明

期刊订阅

引用本文

分享

相关视频

文章指标

历史

文章二维码