Abstract
To solve the limitations of the unsupervised anomaly detection method,which cannot utilize the prior knowledge of anomalies, relies heavily on the anomaly distribution hypothesis, and is insensitive to edge anomalies near normal data,a contrastive learning method based on Siamese attention representation network is proposed to realize anomaly detection. In the method, two encoder networks constructed using multi-head attention learn different representations of the same unlabeled data, and a self-supervised representation learning model is implemented from one encoder network to another encoder network using contrast learning structure. Then, a classifier network is added to realize the detection of abnormal data through weakly supervised classification learning based on a small amount of labeled data. The experimental results show that this method can effectively improve the recall rate of anomaly detection and can be applied in the field of data analysis.
0 Introduction
Anomaly (outlier) detection is the process of finding data instances that have significant deviations from most data instances. As one of the indispensable key technologies in the field of artificial intelligence (AI) , such as data mining, machine learning, and computer vision, anomaly detection is often used in the pre-processing stage of data analysis, and plays an increasingly important role in practical applications, such as industrial monitoring[1], telecommunication fraud[2], health care[3], and network security[4]. Due to the scarcity of abnormal data, the complexity of abnormal situations, the imbalance in the number of abnormal data, and the absence of abnormal data labels, it is challenging for supervised or semi-supervised methods to play a role. Most of the research on anomaly detection focuses on unsupervised anomaly detection as indicated in Refs.[5-7]. The classical anomaly detection methods are mainly divided into statistical anomaly detection[8], density-based anomaly detection[9], distance-based anomaly detection[10], and clustering-based anomaly detection[11]. These methods are usually limited by the data itself, or cannot be applied to high-dimensional data, or are not suitable for data sets with unknown distribution, or have high computational costs, or exhibit low processing efficiency of large data sets[12].In particular, it is insensitive to some outliers that are very close to normal data, which often leads to misjudgment of data and a low recall rate of anomaly detection.
In recent years, deep learning has shown strong learning ability in the representation learning of complex data, such as high-dimensional data, time series data, spatial data, and graph data, which has attracted much attention in the research of deep anomaly detection[13]. It uses a deep neural network to learn the feature representation of different data structures, and judges the abnormal situation of data through the abnormal scoring function, which can effectively improve the performance of anomaly detection, such as Autoencoders (AE) or Variational AE (VAE) , Generative Adversarial Networks (GAN) , self-supervised classification, etc.
Anomaly detection based on AE network[14-15] mainly relies on the encoding and decoding processes to reconstruct the original data, and the data reconstruction error is used as the anomaly score to judge the anomaly. However, the data reconstruction process of encoding and decoding may be affected by abnormal data, resulting in deviation. GAN-based anomaly detection[6, 16] learns the potential feature distribution of normal data by a generative network, and some form of residual between the generated instance and the real instance is defined as an anomaly score to achieve the judgment of abnormal data. However, the training of GAN has the problem of convergence failure or mode collapse, and the abnormal score will be affected by the generator network. Self-supervised classification [17] involves building a self-supervised classification model to learn the normal representation of data, which identifies data inconsistent with the classification model as abnormal. However, this method mainly realizes the self-supervised learning process through feature transformation, and its operation is limited to image data.
Since the proposal of contrastive self-supervised learning, with the introduction of algorithms such as MoCo[18], SimCLR[19], BYOL[20], and SimSiam[21], contrastive self-supervised anomaly detection methods based on data such as images and sequences have been gradually proposed[22-24].These methods realize the representation of data through contrastive learning. The calculation of the abnormal score depends on the designed comparative loss function or is set according to the method of downstream tasks.Because it is mostly aimed at images, sequences, and other types of data, there is a lack of research on the most common anomalies in table data.
To sum up, in the learning process of deep anomaly detection, the effect of semi-supervised anomaly detection is good, but it often requires a sufficient number of labeled samples; unsupervised anomaly detection is the most commonly used, but the learning process of feature representation is susceptible to abnormal data and cannot make use of the prior knowledge of existing labels. In fact, no matter whether there is abnormal data with labels, it is not difficult to collect and label a small amount of abnormal data, which also makes the weak supervision method more suitable for anomaly detection. However, how to realize the learning process of anomaly detection through a few labeled samples is still a difficult problem for weakly supervised learning. With the proposal and development of self-supervised learning, generative or contrastive self-supervised learning[25] performs well in the field of computer vision because of its deep representation learning without sample labeling. This enables it to be gradually applied in anomaly detection. The self-supervised learning method can learn the feature representation of the data without labeling the sample data, which can effectively avoid the sample label problem. However, the learning process of the generative self-supervised method will also be affected by the abnormal data. The contrast self-supervised learning can shorten the distance between the positive samples and lengthen the distance between the negative samples to better distinguish the edge outliers. Therefore, the self-supervised learning can better adapt to the representation learning of unlabeled data.
From the perspective of spatial distribution, the diversity of abnormal data makes it difficult to cluster them in adjacent areas, while normal data often have similar representations. Based on this point, if the self-supervised learning of two attention contrast structures is used to narrow the region of normal data, so as to expand the distance between abnormal and normal, the distinction between abnormal and normal may be more obvious. It is helpful for weakly supervised anomaly detection learning based on a small amount of labeled sample data.Therefore, this study mainly proposes a data anomaly detection method based on Siamese Attention Representation Network (SiamARN) by combining comparative self-supervised learning and weak supervised learning for table data. The contrastive learning structure based on Siamese network[26] is used to expand the distance between different data to avoid the problem that edge anomalies are difficult to detect. The representation network of Multi-Head Attention (MHA) [27] is added to improve the learning performance of data deep representation to realize the deep representation model of unlabeled data. A classifier network is added after the representation model to realize anomaly detection through weakly supervised learning of a small number of labeled samples. Finally, experiments show that the method can better realize the detection of abnormal data and improve the recall rate of anomaly detection. The main contributions of this study are as follows:
1) A contrastive self-supervised learning method is proposed to improve the discrimination of data in normal and abnormal ranges, which is helpful to realize the weakly supervised learning of data anomaly classification with a small amount of labelled samples in the later stage.
2) Data augmentation based on data processing is used to obtain different views of the data, and an encoder network with multi-head attention is proposed to help the data get a suitable deep representation, which helps to improve the performance of the model in anomaly detection.
3) An anomaly detection method combining self-supervision and weak supervision is proposed, and a series of experiments are designed to prove the effectiveness and superiority of the method.
The rest of this study is organized as follows: Section 1 is a brief description of the relevant work research. Section 2 introduces the principle and framework of the method. Section 3 shows the experimental details and results analysis. Section 4 summarizes the full text.
1 Contrastive Learning
The Siamese network is a connected network model for data comparison, which realizes the comparison and judgment between data through two or more input shared weight neural networks. As a supervised learning algorithm, it is often used in face verification[24], target tracking[28], one-time learning[29], and so on. In recent years, Siamese network has begun to be applied to the research of self-supervised learning. By comparing the loss function, the network parameters are learned to realize the deep representation of complex data such as images and sequences.
Contrastive learning is a discriminative representation learning method based on the contrastive thought. The main idea is to make positive samples close to each other and negative samples far away from each other. This method has been widely used in self-supervised learning, called contrastive self-supervised learning. Among them, the Siamese network structure is usually used as a carrier for contrastive learning, such as MoCo and SimCLR, which realize image representation learning through positive and negative sample contrast. With the proposed algorithms such as BYOL and SimSiam, the comparative learning method gets rid of the model's requirements for negative samples and realizes its own contrastive learning under different data augmentation. In particular, SimSiam[21] proposed that the stop-gradient operation could prevent model collapse. These algorithms are mainly used in computer vision, natural language processing, and so on. However, these data augmentation methods are often applicable to images, sequences, and other data to obtain different positive and negative (or positive) samples to achieve deep representation learning.
Here, we mainly focus on anomaly detection of tabular data, and propose different data processing methods as augmentation methods to obtain different views of data, and distinguish normal data from abnormal data through the attention contrast learning process. The collapse of the model is prevented by batch normalization (BN) and stop-gradient (stopgrad) operation.
2 Method
Aiming at the problems of insensitivity to edge anomalies and waste of prior knowledge of anomalies in anomaly detection algorithms, an anomaly detection method called SiamARN is proposed. This section mainly introduces the overall framework, the implementation process and the pseudo code of the SiamARN.
2.1 SiamARN Framework
The structure of SiamARN is shown in Fig.1, which mainly includes two parts: data representation learning process and classification learning process. Among them, the representation learning process includes data augmentation processing, encoder network and predictor network. Classification learning is realized by pre-training encoder model and classifier network.
Firstly, data augmentation is used to obtain different views of the input x, namely, view 1∶x1 and view 2∶x2, as two different inputs for the subsequent contrastive structure.
Fig.1The structure of SiamARN
Secondly, the encoder network is composed of Multi-Head Attention (MHA) module and projection module. In the MHA module, K, Q, and V represent key, query, and value, respectively. The projection module contains two layers of the same fully connected layer, Batch Normalization (BN) layer and ReLU layer. The encoder network is mainly used to learn the deep representation of tabular data. The weights are shared between the encoder to obtain two different vector representations, representation 1 and representation 2, and fix the network to send it into the representation learning. Then, the predictor network is composed of a set of fully connected layers, BN layers, and ReLU layers, which is mainly used to realize the learning from representation 1 to representation 2. The similarity is used to judge the error between the two representations and the stop gradient is added to prevent the model from collapsing, where grad denotes gradient, stop-grad denotes stop gradient.
Finally, the classifier network is a two-layer fully connected network, which is connected to the pre-trained encoder model obtained by representation learning. Its function is to classify and judge the deep representation of the output of the encoder network. By fine-tuning the network parameters of classifier and encoder with a few samples, weakly supervised learning of anomaly detection under deep representation is realized.
2.2 Algorithm Implementation
2.2.1 Simple Siamese network
The representation learning process of data is a simple Siamese network learning process, and its simplified structure is shown in Fig.2. The encoder network f shares weights, and the predictor network h matches the output of one view with the output of another view.
The function representation corresponding to the encoder network f and the predictor network h is shown in Eqs. (1) - (2) .
Fig.2Simple Siamese network structure
(1)
(2)
The data is represented by two different views x1, x2 through the data augmentation process T . Through the encoder network and the predictor network in turn, the vector p1 and vector e2 are obtained, respectively. The negative cosine metric function is used to calculate the similarity and minimize it, as shown in Eq.
(3)
where is L2-norm. Further define a symmetrized loss function with a lower bound of -1:
(4)
To prevent the model from collapsing, a stop⁃ gradient is added to so that e1 and e2 are regarded as constants to obtain a new loss function:
(5)
To explain the effect of stop-gradient and predictor network in the model, it is described from the perspective of expectation maximization. The process involves two sets of variables, and the stop-gradient operation introduces an additional variable φ. Rewrite the loss function as:
(6)
where Φ represents the network parameterized by θ, represents the distribution of x and φ. It is worth noting that φ is not necessarily the output of the network. The optimization goal is to find θ and φ to minimize the expectation of the loss function, that is:
(7)
Based on the above expression, Eq. (6) can be solved by an alternately optimized, that is, fixing one set of parameters and optimizing another set of parameters, as shown below:
(8)
(9)
where t represents the index of the alternating process, ← represents the allocation.
For the solution of θ: the Stochastic Gradient Descent (SGD) optimizer is used to solve Formula (8) . Stop-gradient operation is a natural result because φt-1 is a constant when the gradient does not propagate back.
For the solution of φ: Formula (9) can be solved independently for each φx. As shown in Eq. (6) , the expectation is the distribution of augmentation , φx can be derived by Formula (10) . It shows that φx is assigned to the average representation of the augmentation distribution of x .
(10)
The Siamese network model is alternately approximated by Formulas (8) and (10) . In this process, an augmentation method is first selected, and the augmentation is only sampled once, so that is ignored, and the Formula (11) is obtained.
(11)
Then, substituting Formula (11) into Formula (8) , we get Formula (12) :
(12)
where θ t is a constant. implies another view due to its random nature. This formulation exhibits the Siamese architecture. If we implement Eq. (12) by reducing the loss with one SGD step, then we can approach the SimSiam algorithm: a Siamese network naturally with stop-gradient applied.
The above process does not involve the role of the predictor network h in the model. As shown in Fig.2, the predictor network h is used to minimize the output of the encoder network f, which is expressed as follows:
(13)
The corresponding optimal solution satisfies:
(14)
Obviously, Eq. (14) is similar to Formula (10) . In Formula (11) , since the expectation is ignored in one sampling, it can be considered that the function of predictor h is to fill this gap. Because it is almost impossible to directly calculate the expectation of the augmentation , but it is also a feasible scheme to learn the appropriate parameters by using the prediction network to achieve the prediction expectation. After multiple epochs of training, the sampling of is implicitly distributed in the parameters of the predictor.
2.2.2 SiamARN algorithm
1) DA algorithm. It is a technique that generates new samples by making certain changes to the data. DA has been widely used in computer vision and natural language processing, such as geometric transformation, fuzzy processing, random occlusion, and other operations on image data, random insertion, random exchange, random deletion, and other operations on text data. However, for general tabular data, there are few data augmentation methods to change it to obtain different augmented data. In fact, in contrastive learning, one of the purposes of data augmentation is to contrast the representation of data in different views, and to obtain the deep representation of data by the contrastive learning process, which means that in anomaly detection, we need to obtain different representations of data in some way for contrastive learning without changing the original distribution of data. Therefore, this study proposes selecting different data processing methods to obtain representations of tabular data from different perspectives for the contrastive learning process of data. There are three main methods used:
a) Min-Max normalization:
b) Z-score normalization:, whereis the mean, s is the standard deviation.
c)L1 ⁃norm normalization:is L1 ⁃norm.
2) Encoder network. As shown in Fig.1, the encoder network is composed of MHA and projection modules. The weights are shared between the two encoder networks. The augmented data learns the data feature relationship through the MHA module, and then connects two groups of fully connected networks. Each group of networks includes a fully connected layer, a BN layer, and a ReLU layer. The encoder network is used to realize the deep representation of data, and the detailed process is as follows.
Suppose that the input dimension of the single-branch encoder network is (b, 1, n) , where b represents the batch size, n is the feature dimension of data x. MHA is used to learn the representation of , the number of heads is H, and each attention head is calculated by scaling dot product attention:
(15)
where Hi represents i⁃th head,represents query matrix,represents key matrix, represents value matrix, dk is the key dimension.After connecting the output of each head, the MHA is:
(16)
where represents a learnable parameter matrix. Attention AttH is sent to the fully connected layer fL, and the internal covariate shift problem [30] in the network learning process is solved by the BN layer fBN. The output of the encoder network is obtained by the ReLU activation function.
Note that only a set of fully connected network calculations is implemented here. Because the two sets of calculation processes are the same, the second is ignored.
3) Predictor network. The predictor network is explained in Section 2.2.1. It is only applied to the pre-training learning process of data representation and will not be used as a reserved part of the pre-training model.
4) Classifier network. The classifier network performs anomaly detection and judgment on the deep representation data, which consists of a fully connected layer, a ReLU layer, an output layer, and a Softmax layer. It forms an anomaly detection model together with the pre-trained encoder network. In the learning process of the model, the network parameters are fine-tuned and learned through a small amount of labeled data, and the prediction results of the data are output through the Softmax layer. The network selects cross entropy as the loss function.
Based on the above analysis, the pseudo-code of SiamARN algorithm is shown as follows.
2.3 Anomaly Detection Based on SiamARN
The anomaly detection process based on SiamARN is divided into three steps. The process is shown in Fig.3, including the representation learning step, classification learning step, and anomaly detection step.
Fig.3Anomaly detection procedure
Firstly, the unlabeled data can get different views of the data through DA. The two views can obtain the representation description of the data through representation learning, corresponding to the encoder network in Fig.1, and send it into the classification learning. Then, a small number of labelled data are trained for classification learning after standardization. The purpose is to obtain a trained anomaly detection model for the final data anomaly judgment. Finally, after the unknown data is standardized, the anomaly detection is carried out in the saved model, and the judgment result shows whether the data is abnormal.
3 Experimental Evaluation
This section mainly introduces the experimental content and result analysis of the SiamARN method. Several public data sets are selected for comparative experiments to verify the effectiveness of the proposed method in anomaly detection. The algorithm is based on Python 3.6 version and PyTorch deep learning framework.
3.1 Experimental Data
In the experiment, 14 public data sets in different fields were selected to verify the performance of SiamARN method in anomaly detection. The experimental data include medical field data, such as thyroid, breast cancer, diabetes, etc., as well as industrial field data, such as flight abnormalities, glass production, etc., and some common data sets, such as digital representation, letter pronunciation, etc.The feature dimensions of these data sets are different, covering a variety of data set sizes. Moreover, the proportion of abnormal data samples is also different, and it contains a variety of abnormal detection data situations, which have a certain universality. All experimental data are from https://odds.cs.stonybrook.edu/. Detailed data are shown in Table1.
Table1Description of experimental data
3.2 Experimental Settings
The accuracy rate (Acc) , recall rate (R) , F1 score (F1) and Area under the Precision-Recall Curve (AUPR) were selected as the evaluation indexes. The calculation formulas of the first three evaluation indexes are as follows:
(20)
(21)
(22)
where TP represents true positive, TN represents true negative, FP represents false positive, FN represents false negative.
The number of fully connected layer neurons in the encoder network and the predictor network was 128, and the number of fully connected layer neurons in the classifier network was 64 and 2, respectively. The initial learning rate is 0.002, which decreases with the increase in epoch. The batch sizes of the two learning were64 and 32, respectively. The optimizer was SGD. In the SiamARN algorithm, due to the weakly supervised learning of a few samples, only 10%-20% of the data was selected as a small number of labeled data for anomaly detection learning. The parameters of other anomaly detection algorithms directly used the default parameters in the PyOD4 library, and the corresponding data set was divided into a training set (0.8) and a test set (0.2) .
3.3 Experimental Results and Analysis
3.3.1 Experimental results of SiamARN
This section mainly discusses the role of key structures and parameters in SiamARN. The experiment mainly includes the comparison of data changes before and after representation learning, the comparison of DA method selection, the ablation experiment of network parameters, and the comparison experiment of MAH and deep neural network.
1) Representation learning.
Fig.4 shows two cases in anomaly detection. Fig.4 (a) shows anomaly data located at the edge of the normal data range, which are often misjudged as normal because they are close to the normal data. Fig.4 (b) shows normal data located at the edge of the normal data range. Because they are free from the range, it is often misjudged as abnormal. Figs.4 (c) and (d) are the deep representations obtained by the encoder network. For the above two cases, it is obvious that the distance between normal data is narrowed after the representation learning. The distinction between normal data and abnormal data becomes obvious, making it easier to realize the anomaly detection.
Fig.4Data comparison before and after representation learning
Fig.5 describes the distribution of the deep representation of the original data in different dimensions through representation learning from a multi-dimensional perspective. Among them, Fig.5 (a) is the original data map, Figs.5 (b) , (c) and (d) are the representations of the encoder network output on different feature dimensions. It can be seen that the distinction between abnormal and normal in multi-dimensional is still more obvious, but there are still individual anomalies slightly close to the edge of normal. Due to the variability of anomalies, it is difficult for contrastive representation learning to cluster all anomalies in similar regions, so abnormal data are often scattered in multiple locations, which can be distinguished from normal data gathered together.
Fig.5Representation learning of multi-dimensional data
2) DA analysis.
Table2 shows the anomaly detection results obtained after selecting different DA methods to process the data. Obviously, in SiamARN, representation learning can be achieved under the premise of the two different views of data. According to the experimental results, in most cases, there are small differences between the left and right order results of the same DA combination, but this difference itself contains the randomness of network learning, which has little effect on the experimental results and can often be ignored.
Among them, the experimental results based on the combination of the Z-score processed view and the original view are the best, and the experimental results based on the combination of (L1, -) data based on Max-Min are the worst. Different DA methods correspond to different experimental results. In general, the results are similar. This shows that, in most cases, the choice of DA has a certain impact on the performance of the model, but the impact is relatively small. The only special case is shown in the last column in Table2, which is quite different from the results of other combinations. Therefore, in the selection of DA method, it is still necessary to select a more appropriate method after a certain comparative experiment.
Table2Experimental results under different DA methods
Note:“-” indicates that no DA processing was performed on the input data.
3) Ablation experiment.
Table3 and Fig.6 show the results of the ablation experiment and the loss function curve of the representation learning process, respectively, where SG represents the stop-gradient, BN represents the batch standardization, Proj represents the Projection module, and Pred represents the predictor module. The experiments discuss the role of stopping gradient and batch normalization in the network. When both stop-gradient and batch normalization are retained, the experimental results of the model are the best. It can be seen from Table3 that the experimental results of the model gradually deteriorate as the stop-gradient and batch standardization disappear in turn. Among them, the results of (c) , (e) , (g) and (h) are the worst, and their common feature is that there is no batch standardization of projection module, which indicates that the batch standardization under this module has a significant influence on the experimental results of the model. Obviously, batch standardization has a great optimization effect on the learning effect of the model. According to the experimental results, it seems that the experimental results of the model do not decrease much when there is no stop-gradient. However, combined with Fig.6, the loss function will decrease rapidly without stop-gradient, which also shows that the stop-gradient plays a role in preventing model collapse in the representation learning.
Table3Results of ablation experiment
Comparing the loss function curve in Fig.6, it shows that the loss function decreases rapidly and is very close to the minimum loss of-1 when there is no stop gradient or batch normalization. Especially when the two do not exist at the same time, the loss function curve directly becomes-1. Obviously, the model finds a ‘convenient’ path in the learning process, such as an extreme possibility: all weights are updated to 0. In this case, the model collapse phenomenon, also known as collapsing solutions, occurs. According to Fig.6, the two better cases are (a) and (d) , respectively. Compared with other cases, batch standardization and stopping gradient based on the projection module play a key role in preventing model collapse.
4) Attention.
Table4 compares the performance of the encoder network using MHA and Deep Neural Network (DNN) on the anomaly detection method. Among them, the deep network is composed of three groups of fully connected networks. The structure and parameters of the network are consistent with the predictor network h. Table5 shows the average time for the model running1 epoch under the two networks. Fig.7 compares the convergence of model learning under the same data set.
Fig.6Loss function in ablation experiment
Table4Experimental results of MHA and DNN
Table5Running time
Fig.7Loss function of MHA and DNN
Compared with Tables 4-5 and Fig.7, in terms of model performance, training time, and convergence speed, the representation learning structure using MHA is significantly better than DNN. MHA processes multiple independent attention heads in parallel, and each head can focus on capturing a specific correlation, thereby capturing the deep correlations between data features. While improving the learning efficiency of the model, it can better realize the deep representation learning of data. Therefore, MHA is more suitable for the method proposed in this paper.
3.3.2 Contrast experiment
In the experiment, eight commonly used anomaly detection methods were selected for comparison, including KNN[31], LOF[32], CBLOF[11], OCSVM[33], IForest[34], KPCA[35], ABOD[10], and AE[36]. Experimental results for these methods are derived from an open source Python toolkit for anomaly detection, PyOD4[37]. Tables 6-9 show the experimental results of SiamARN and its comparison methods on 14 different data sets.
Table6Accuracy comparison of SiamARN versus seven anomaly detection methods
In the experiment, the more commonly used classical anomaly detection methods and deep anomaly detection methods are selected for comparison.The results show that the SiamARN anomaly detection method proposed in this paper shows the best performance. In most cases, the performance of classical anomaly detection methods is worse than that of deep anomaly detection methods, and the deep anomaly detection algorithm is worse than SiamARN. On some datasets, the performance of deep anomaly detection, such as AE algorithm, is basically similar to SiamARN, but the applicability and generalization of AE are obviously inferior to SiamARN.
The advantage of SiamARN mainly lies in the contrastive learning structure. Different views of its own data are used as contrastive inputs, and the similarity of deep representations is used as a loss function for training. This allows the model to learn without relying on the labelled samples, without building negative samples for contrast, and without considering the interference of abnormal data. It maps the original data to a new feature space, so that the abnormal and normal data with an adjacent relationship are far away from each other. Therefore, the distinction between abnormal and normal data is improved. The resulting deep representation can be easily distinguished between abnormal and normal data by a simple classifier network.
Table7Recall rate comparison of SiamARN versus seven anomaly detection methods
Table8F1 scores comparison of SiamARN versus seven anomaly detection methods
Then, why not use the above KNN and other simple algorithms directly for anomaly detection on deep representation data? This approach is similar to the deep learning method for feature extraction mentioned in Introduction. On the one hand, the method of deep learning based on feature extraction mainly extracts low-dimensional features from high-dimensional and nonlinear complex data, such as images and sound waves, to reduce data complexity.
However, for tabular data, the representation learning result of SiamARN is not a simple dimensionality reduction. On the contrary, the number of features of deep representation output by the encoder may be more than the original data. Its main purpose is to narrow the distance between similar data, which makes the normal data close to each other. Although abnormal data may exist in various cases, their distribution is different from that of normal data, so it can achieve the purpose of keeping away from normal data. On the other hand, Ref.[13] mentioned that the process before and after the method is completely disconnected, which easily leads to unsatisfactory abnormal scores. Compared with SiamARN, the encoder network used for representation learning participates in the later anomaly detection. Therefore, the weakly supervised learning process with a small number of label samples can be used to distinguish abnormal data from normal data. This is also illustrated by the experimental results in Tables 6-9.
Table9AUPR comparison of SiamARN versus seven anomaly detection methods
3.3.3 Interpretability analysis
The interpretability of the SiamARN method mainly includes the self-learning process of the Siamese contrastive structure and the feature representation learning based on MAH.
The self-learning process of Siamese contrastive structure refers to maximizing the similarity between different enhanced views of the same input to learn feature representation. It introduces dynamic targets and asymmetry through predictors and stop gradient, implicitly achieving the goal of contrastive learning. The stop gradient interrupts the gradient backpropagation process of one branch, and the predictor only acts on one branch. This asymmetry forces the model to learn a non-trivial representation and prevents the model from collapsing. Thus, the model gradually converges to a meaningful feature representation in the alternating optimization process.
The feature representation learning of MHA refers to the feature representation of the input obtained by the encoder network in the representation learning. MHA parallel computes the attention weight through multiple independent ‘heads’. Each head has an independent query, key, and value weight matrix to generate different attention distributions.
The outputs of each head are concatenated and linearly transformed to obtain the final result. It allows the model to focus on multiple dependency patterns simultaneously, including local and global relationships. The weight calculation can visually display the correlation information between the features of each position and other features, enhancing the interpretability of the model.
4 Conclusions
In this paper, a weakly supervised anomaly detection method called SiamARN is proposed. The contrastive self-supervised learning is used to realize the representation learning of unlabelled data, and the weakly supervised classification learning for data anomaly detection is realized by using a few existing labelled data. Experimental results show that the method exhibits better performance in anomaly detection.
The main advantages of this method are as follows: Firstly, it can solve the problem that unsupervised anomaly detection methods cannot utilize the prior knowledge of existing anomalies. And SiamARN avoids the waste of prior knowledge and improves the recall rate of anomaly detection by a weakly supervised learning process. Secondly, the problem of edge anomaly detection is solved by the contrastive learning structure. The obtained deep representation data makes the distinction between abnormal and normal data more obvious in the feature space. Finally, the introduction of self-supervised representation learning can avoid the problem of dependence on the abnormal distribution hypothesis, and a network structure based on MHA is designed to learn the deep relationship of the data without analyzing the original data.This eliminates the need for analysis and processing of raw data.
However, according to the experimental results, it can be seen that SiamARN still has a large room for improvement in rating indicators, such as the recall rate in some data. Therefore, the research goal of the next stage should focus on data representation learning, that is, how to make the distinction between abnormal data and normal data easier through new comparative learning methods.