Support vector machine
Advantages: the fastest method for finding decision functions. The presence of a unique solution, due to solving a quadratic programming problem in a convex domain. The ability to carry out more confident classification, as the algorithm determines the maximum width of the separating band. Disadvantages: sensitivity to noise and data standardization. In the case of linear inseparability of classes, there is a lack of a clear approach to the automatic choice of the kernel (construction of the rectifying subspace as a whole).
Advantages: the ability to transform input information into output information without using information about the probabilistic data distribution model. There are many possibilities for optimizing the model through the use of nonlinear artificial neurons. Ability to retrain and adapt an algorithm to a changing, non-stationary environment. The universality of the algorithm. The ability to use one design solution in several subject areas. High speed of implementation. Disadvantages: difficulty in choosing a suitable neural network structure for a specific task; trained neural networks are not interpretable models – "black boxes", so the logical interpretation of the described patterns is almost impossible. Ability to process only numeric variables.
K-Nearest Neighbors Algorithm
Advantages: ease of implementation and the ability to introduce additional sampling settings. The logic of the algorithm is easily interpreted.It is actively used in the fields of medicine, biometrics, jurisprudence. Disadvantages: inefficient use of memory due to the need to keep the full sample. The large number of operations performed makes the algorithm labor-intensive.
Advantages: clarity and easy interpretation of the result. Ability to work with both numeric and nominal attributes. High speed of implementation. The ability to process data that contains erroneous or missing values. Disadvantages: the requirement of some algorithms for the discrete nature of the received target attributes. The sample should contain several significant features. The presence of complex relationships between the elements of the set under consideration leads to a decrease in the quality of the algorithm. They are ineffective in solving classification problems with a large number of classes. High dependence of the result on the quality of the training sample (the presence of noise near the root of the tree can lead to a non-optimal feature during division).
Novelty is an object that is fundamentally different in its properties from the previous selection objects and is characterized by completely new behavior under unchanged conditions. When conducting novelty detection, new objects are detected that differ from the previous ones, but are not necessarily outliers. That is, the algorithm estimates how similar the new value is to the existing sample.
Outlier is an object that is the result of data errors such as rounding, measurement inaccuracies, incorrect entries, typos, etc.; noise objects arising as a result of misclassification; objects belonging to other selections included in the considered selection. Outlier detection is aimed at detecting objects that distort the total sample: abnormally high/low or too volatile values, etc. on an existing sample.
Supervised anomaly detection
Supervised anomaly detection is a method in which already labeled data is fed into the model for training, having labels ("data marks") previously characterized as outliers.
Semi-supervised anomaly detection
Semi-supervised anomaly detection – the input is sampled, consisting only of normal values, without any deviations. The main idea is that anomalies will be detected at subsequent stages as a result of deviations from the values belonging to the original sample.
Unsupervised anomaly detection
Unsupervised anomaly detection – a case when there are no data labels, and the developed algorithm needs to independently determine which data are anomalies. Moreover, with this option, there is no particular difference between training and test data. The idea is to detect anomalies based on the intrinsic properties of the data. Typically, the distances or densities parameters are used to decide whether a value is an outlier or not.
Determination of anomalies is one of the key tasks in preparing data for further analysis and modeling. The quality of the approach chosen in identifying anomalies is usually measured in the accuracy of the result obtained.
The most relevant areas of application of such methods are seen in medicine and payment systems. For example, in the first option, it is necessary to describe as qualitatively as possible the possible side effects of a particular medical preparation in order to avoid the appearance of undesirable effects in potential patients.
An example for the second option is the detection of anomalies in transactions on customers' bank cards. Here, the accuracy is already expressed in monetary value, as well as in the growth of customer distrust of the bank, which allowed third-party interference in customer accounts. Insufficiently correct anomaly detection can lead to important anomalies in the data being non-deterministic in the information flow.
This can create loopholes for fraudsters who can adapt to the ineffectiveness of the control algorithm. For example, setting a low threshold for cutting off anomalies (by parameters such as the frequency of authorization attempts or the frequency of external money transfers across a subset of accounts) may lead to the fact that small fraudulent transactions will not be noticed, while large algorithms will correctly identify threats.
Therefore, today the detection of anomalies is more practical than research. There are also a number of applications for this kind of algorithms:
Each of these areas involves the use of different methods to identify anomalies.
The choice of models depends on a number of factors:
When choosing a model, the analyst is guided by his own criteria. Each method for detecting anomalies is, to one degree or another, effective/ neffective when applied to specific tasks. Typically, several methods are used to improve the result. Statistical methods are combined with machine learning methods, and graphical methods are used to simplify and coarse clipping anomalies.