Teacher-student complementary mask autoencoder for self-supervised representation learning

doi:10.11918/202302029

Home > Archive>Volume 58, Issue 3, 2026 >74-87. DOI:10.11918/202302029

Teacher-student complementary mask autoencoder for self-supervised representation learning
DOI:
                        10.11918/202302029
                    
CSTR:
                        
Author:
                        
Affiliation:(1.School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430063, China; 2.Intelligent Transportation System Research Center, Wuhan University of Technology, Wuhan 430063, China; 3.Research and Development Center of Transport Industry of New Generation of Artificial Intelligence Technology, Hangzhou 310013, China)
Clc Number:TP399
Fund Project:undefined

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

To address the problem of mismatch between upstream and downstream tasks exhibited by masked image modeling (MIM) methods in self-supervised representation learning, we proposed a novel pre-training model, called teacher-student complementary masked autoencoder, or in other words, the TSCAE model. The TSCAE model consists of two modules with complementary masked mechanisms, called teacher module and student module, respectively. The teacher module was designed as a Transformer-based structure to predict the masked region of an image (e.g., randomly masking 75% of the input image), while the student module employed a sole encoder to predict the remaining region of the same image (e.g., masking the remaining 25% of the input image). Meanwhile, to attain a richer visual representation from a large number of unlabeled data, the TSCAE model completed two kinds of upstream tasks, namely prediction and contrastive tasks. After that, the TSCAE model achieved the pre-training on COCO and Tiny-ImageNet datasets. The results demonstrate that across three public datasets including VOC and two private datasets, the proposed TSCAE model achieves better performance than the classical masked autoencoder (MAE) methods on downstream tasks such as image classification, object detection, and semantic segmentation. In particular, the TSCAE also alleviates the impact of the quality of the pre-training images on the visual representation learning encoder to a certain extent.

Reference

Cited by

Get Citation

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:February 16,2023
Revised:
Adopted:
Online: March 31,2026
Published:

Publication Statement

Journal Subscription

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code