| 引用本文: | 黄靖,叶少雄,文元桥,朱立夫,黄亚敏.用于自监督表征学习的教师学生互补掩码自动编码器[J].哈尔滨工业大学学报,2026,58(3):74.DOI:10.11918/202302029 |
| HUANG Jing,YE Shaoxiong,WEN Yuanqiao,ZHU Lifu,HUANG Yamin.Teacher-student complementary mask autoencoder for self-supervised representation learning[J].Journal of Harbin Institute of Technology,2026,58(3):74.DOI:10.11918/202302029 |
|
| |
|
|
| 本文已被:浏览 1168次 下载 21次 |
 码上扫一扫! |
|
|
| 用于自监督表征学习的教师学生互补掩码自动编码器 |
|
黄靖1,3,叶少雄1,文元桥2,朱立夫1,黄亚敏2
|
|
(1.武汉理工大学 计算机与人工智能学院,武汉 430063; 2.武汉理工大学 智能交通系统研究中心, 武汉 430063; 3.新一代人工智能技术应用交通运输行业研发中心, 杭州 310013)
|
|
| 摘要: |
| 针对自监督表征学习中掩码图像建模(MIM)方法存在上下游任务不匹配的问题,提出了一种称为教师学生互补掩码自动编码器的新预训练模型,即TSCAE模型。该模型由具备互补掩码机制的教师模块和学生模块组成,其中教师模块基于Transformer结构,负责预测图像中掩码区域(如随机掩蔽输入图片的75%部分);学生模块则采用单一的编码器结构预测同一图像中剩余区域(如掩蔽输入图片余下的25%部分)。为从大量无标签数据中预训练出更丰富的视觉表征,TSCAE模型同时完成两类上游任务,分别是预测任务和对比任务,并在COCO和Tiny-ImageNet数据集上完成预训练。测试结果表明,在包括VOC在内的3个公有数据集和2个私有数据集上,TSCAE在图像分类、目标检测和语义分割等下游任务中,性能均优于经典的掩码自编码器(MAE)。特别地,TSCAE还在一定程度上缓解了预训练图像质量对视觉表征学习编码器的影响。 |
| 关键词: 预训练模型 自监督学习 掩码图像建模 对比学习 编码器 |
| DOI:10.11918/202302029 |
| 分类号:TP399 |
| 文献标识码:A |
| 基金项目:国家自然科学基金资助项目 (52072287);浙江省科技计划项目 (2021C01010);新一代人工智能技术应用交通运输行业研发中心开放基金(202302H);浙江省交通厅科技项目(2024006) |
|
| Teacher-student complementary mask autoencoder for self-supervised representation learning |
|
HUANG Jing1,3,YE Shaoxiong1,WEN Yuanqiao2,ZHU Lifu1,HUANG Yamin2
|
|
(1.School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430063, China; 2.Intelligent Transportation System Research Center, Wuhan University of Technology, Wuhan 430063, China; 3.Research and Development Center of Transport Industry of New Generation of Artificial Intelligence Technology, Hangzhou 310013, China)
|
| Abstract: |
| To address the problem of mismatch between upstream and downstream tasks exhibited by masked image modeling (MIM) methods in self-supervised representation learning, we proposed a novel pre-training model, called teacher-student complementary masked autoencoder, or in other words, the TSCAE model. The TSCAE model consists of two modules with complementary masked mechanisms, called teacher module and student module, respectively. The teacher module was designed as a Transformer-based structure to predict the masked region of an image (e.g., randomly masking 75% of the input image), while the student module employed a sole encoder to predict the remaining region of the same image (e.g., masking the remaining 25% of the input image). Meanwhile, to attain a richer visual representation from a large number of unlabeled data, the TSCAE model completed two kinds of upstream tasks, namely prediction and contrastive tasks. After that, the TSCAE model achieved the pre-training on COCO and Tiny-ImageNet datasets. The results demonstrate that across three public datasets including VOC and two private datasets, the proposed TSCAE model achieves better performance than the classical masked autoencoder (MAE) methods on downstream tasks such as image classification, object detection, and semantic segmentation. In particular, the TSCAE also alleviates the impact of the quality of the pre-training images on the visual representation learning encoder to a certain extent. |
| Key words: pre-training model self-supervised learning masked image modeling contrastive learning encoder |
|
|
|
|