| 摘要: |
| 为提升模型性能,深度神经网络常需引入不可信数据集,导致易受数据投毒后门攻击。传统检测方法依赖识别毒化与良性样本的特征差异,但当攻击者优化触发器以模糊此边界时,其效果受限。针对此问题,本文提出反向遗忘(reverse forgeting,RFgt)检测方法,利用后门攻击中 “毒化样本占比低”的特性,采用逆向优化策略:强制中毒模型快速遗忘占多数的良性样本特征,同时保留并强化对可疑样本的学习,以巩固其毒化特征,显著放大两类样本的特征差异,最终通过样本预测熵值判定是否为毒化样本。研究表明:RFgt在CIFAR-10和GTSRB数据集上能够有效检测多种后门攻击下的毒化样本,同时对良性样本保持较低的误检率;在Tiny ImageNet数据集上的检测结果证明本方法具备良好的泛化能力。针对4种经典的数据投毒攻击,本方法平均检测真阳率达到99.28%,假阳率仅为0.06%,其综合性能优于现有防御方法。 |
| 关键词: 后门攻击 数据毒化 样本检测 遗忘学习 预测熵 |
| DOI:10.11918/202507065 |
| 分类号:TP391 |
| 文献标识码:A |
| 基金项目:国家自然科学基金(2,7) |
|
| Backdoor poisoned sample detection via reverse forgetting |
|
YAN Leiming,YOU Jianfei
|
|
(Nanjing University of Information Science and Technology, School of Computer Science, School of Cyber Science and Engineering, Nanjing 210044, China)
|
| Abstract: |
| To enhance model performance, Deep Neural Networks are frequently trained on untrusted datasets, rendering them vulnerable to data poisoning backdoor attacks. Conventional detection methods rely on identifying feature discrepancies between poisoned and benign samples. However, their effectiveness diminishes when attackers optimize trigger generation to obscure this boundary. To address this issue, this paper proposes a novel detection method named reverse forgeting (RFgt). The method exploits the characteristic of backdoor attacks, where the proportion of poisoned samples is low, and employs a reverse optimization strategy. Instead of forcing a poisoned model to forget backdoor features, RFgt compels it to rapidly forget the features of the majority class (benign samples), while simultaneously retaining and reinforcing the learning of suspicious samples to consolidate their poisoned features. This approach significantly amplifies the feature disparity between the two sample types. Ultimately, the prediction entropy of the samples is used to determine whether they are poisoned or benign. Experimental results demonstrate that RFgt effectively detects poisoned samples under various backdoor attacks on the CIFAR-10 and GTSRB datasets, while maintaining a low false positive rate. Furthermore, this method demonstrates strong generalization capability, as shown by its performance on the Tiny ImageNet dataset. Specifically, against four classic data poisoning attacks, RFgt achieves an average True Positive Rate (TPR) of 99.28% and a False Positive Rate (FPR) of only 0.06%, outperforming existing defense methods in overall performance. |
| Key words: backdoor attack data poisoning sample detection forgetting learning predictive entropy |