用于自监督表征学习的教师学生互补掩码自动编码器

黄靖; 叶少雄; 文元桥; 朱立夫; 黄亚敏

期刊检索

关键词检索

新闻公告MORE

主管单位 中华人民共和国工业和信息化部 主办单位 哈尔滨工业大学主编李隆球 国际刊号ISSN 0367-6234 国内刊号CN 23-1235/T

期刊网站二维码

微信公众号二维码

引用本文:	黄靖,叶少雄,文元桥,朱立夫,黄亚敏.用于自监督表征学习的教师学生互补掩码自动编码器[J].哈尔滨工业大学学报,2026,58(3):74.DOI:10.11918/202302029
	HUANG Jing,YE Shaoxiong,WEN Yuanqiao,ZHU Lifu,HUANG Yamin.Teacher-student complementary mask autoencoder for self-supervised representation learning[J].Journal of Harbin Institute of Technology,2026,58(3):74.DOI:10.11918/202302029

【打印本页】【HTML】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

过刊浏览高级检索

本文已被：浏览 1168次下载 21次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
用于自监督表征学习的教师学生互补掩码自动编码器
黄靖^1,3,叶少雄¹,文元桥²,朱立夫¹,黄亚敏²
(1.武汉理工大学计算机与人工智能学院,武汉 430063; 2.武汉理工大学智能交通系统研究中心, 武汉 430063; 3.新一代人工智能技术应用交通运输行业研发中心, 杭州 310013)

摘要:

针对自监督表征学习中掩码图像建模(MIM)方法存在上下游任务不匹配的问题,提出了一种称为教师学生互补掩码自动编码器的新预训练模型,即TSCAE模型。该模型由具备互补掩码机制的教师模块和学生模块组成,其中教师模块基于Transformer结构,负责预测图像中掩码区域（如随机掩蔽输入图片的75%部分）；学生模块则采用单一的编码器结构预测同一图像中剩余区域（如掩蔽输入图片余下的25%部分）。为从大量无标签数据中预训练出更丰富的视觉表征,TSCAE模型同时完成两类上游任务,分别是预测任务和对比任务,并在COCO和Tiny-ImageNet数据集上完成预训练。测试结果表明,在包括VOC在内的3个公有数据集和2个私有数据集上,TSCAE在图像分类、目标检测和语义分割等下游任务中,性能均优于经典的掩码自编码器(MAE)。特别地,TSCAE还在一定程度上缓解了预训练图像质量对视觉表征学习编码器的影响。

关键词: 预训练模型自监督学习掩码图像建模对比学习编码器

DOI：10.11918/202302029

分类号:TP399

文献标识码:A

基金项目:国家自然科学基金资助项目 (52072287)；浙江省科技计划项目 (2021C01010)；新一代人工智能技术应用交通运输行业研发中心开放基金(202302H)；浙江省交通厅科技项目(2024006)

Teacher-student complementary mask autoencoder for self-supervised representation learning

HUANG Jing^1,3,YE Shaoxiong¹,WEN Yuanqiao²,ZHU Lifu¹,HUANG Yamin²

(1.School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430063, China; 2.Intelligent Transportation System Research Center, Wuhan University of Technology, Wuhan 430063, China; 3.Research and Development Center of Transport Industry of New Generation of Artificial Intelligence Technology, Hangzhou 310013, China)

Abstract:

To address the problem of mismatch between upstream and downstream tasks exhibited by masked image modeling (MIM) methods in self-supervised representation learning, we proposed a novel pre-training model, called teacher-student complementary masked autoencoder, or in other words, the TSCAE model. The TSCAE model consists of two modules with complementary masked mechanisms, called teacher module and student module, respectively. The teacher module was designed as a Transformer-based structure to predict the masked region of an image (e.g., randomly masking 75% of the input image), while the student module employed a sole encoder to predict the remaining region of the same image (e.g., masking the remaining 25% of the input image). Meanwhile, to attain a richer visual representation from a large number of unlabeled data, the TSCAE model completed two kinds of upstream tasks, namely prediction and contrastive tasks. After that, the TSCAE model achieved the pre-training on COCO and Tiny-ImageNet datasets. The results demonstrate that across three public datasets including VOC and two private datasets, the proposed TSCAE model achieves better performance than the classical masked autoencoder (MAE) methods on downstream tasks such as image classification, object detection, and semantic segmentation. In particular, the TSCAE also alleviates the impact of the quality of the pre-training images on the visual representation learning encoder to a certain extent.

Key words: pre-training model self-supervised learning masked image modeling contrastive learning encoder

期刊检索

关键词检索

新闻公告MORE

友情链接LINKS