2 September 2025
Modern Anomaly Detection: Techniques, Challenges, and Ethical Considerations
Anomaly detection involves identifying unusual patterns or behaviours in data that deviate from what is expected. These deviations often signal critical events such as fraud, system malfunctions, or security breaches. Anomaly Detection is commonly used in various fields, like finance, cybersecurity, manufacturing, and healthcare, aiming to identify and act upon these irregularities proactively.
With the increasing digitisation of systems over the past years, enormous volumes of data have been generated and stored. These include time-series data from sensors, textual logs from applications and systems, and high-dimensional inputs like images or video streams. The growing availability of such data creates significant opportunities for Anomaly Detection. However, labelled anomalies remain scarce, as they are naturally rare, domain-specific, and time-consuming. Therefore, the effectiveness of traditional supervised learning approaches is limited, reinforcing the need for other viable solutions. Additionally, as these systems become more automated and integrated into decision-making pipelines, concerns about fairness, transparency, and accountability are gaining momentum.
Challenges in Anomaly Detection
Building effective anomaly detection systems presents several challenges. As mentioned, anomalies are typically rare, poorly defined, and may evolve, resulting in a lack of labelled examples for training. In many cases, labelled anomalies make up less than 1% of the data, leading to severe class imbalance. As a result, traditional supervised learning methods often struggle, facing difficulties in generalising and an increased risk of overfitting to the limited set of labelled anomalies.
Historically, Anomaly Detection began with rule-based systems and statistical thresholds. Today, machine learning approaches are the ones most used.
Supervised methods, such as decision trees or neural networks, require labelled registers of both normal and anomalous instances. Labelled data is used to train the model by allowing it to learn the underlying patterns, which are then applied to make predictions on unseen test data, as illustrated in the diagram below.
Unsupervised methods, on the other hand, focus on learning the patterns of normal behaviour and flagging any observations that differ significantly from those learned patterns as potential anomalies, as illustrated in the diagram below. Unlike supervised approaches, they do not require labelled data, where the model learns patterns from the training data and applies them to detect anomalies in unseen data. Techniques such as Isolation Forest, One-Class SVM, and clustering algorithms are commonly used as baselines for these applications. However, these methods often struggle to distinguish between truly anomalous behaviour and expected behaviour. This is especially challenging in complex or noisy datasets, where high variability and randomness make it difficult to establish a reliable definition of normal behaviour. The task becomes even more complicated when the specific types and patterns of anomalies are unknown beforehand, as the system must identify authentic anomalies without any prior examples to guide it.
Semi-supervised methods integrate aspects from both supervised and unsupervised techniques, leveraging a small amount of labelled data alongside a larger quantity of unlabelled data. Several semi-supervised techniques are available; however, only two are described in detail below.
1.Self-Training
Self-training is one of the simplest techniques for implementing semi-supervised learning. It starts with a small set of labelled data used to train an initial model. The model then generates predictions for the unlabelled data, and if it is sufficiently confident in some of those predictions, they are added to the labelled dataset. The model is retrained using this expanded dataset, and the process is repeated iteratively, either for a predefined number of N cycles, until no high-confidence predictions remain, or until all data records have been labelled.
N times, which can be predefined or when there are no more high-confidence predictions or even if all registers are labelled.
This approach can be practical when the model’s early predictions are accurate; however, if the model is overly confident in incorrect predictions, those errors can propagate and degrade performance over time. The diagram below illustrates a typical example of the self-training process. The first part demonstrates the initial training using a small, labelled dataset. In contrast, the second part represents the addition of confidently predicted labels, resulting in an expanded training set for the next iteration.
2.Co-Training
Co-training is a semi-supervised learning technique that uses multiple perspectives of the same data. Complementary feature sets are used, each one containing distinct and independent information relevant to the target variable.
The method begins with a small set of labelled data, similar to the self-training approach, and a larger set of unlabeled data. The data is split into two separate groups based on the features, each containing different and complementary information. Then, two separate models are trained using different feature sets, and during each iteration, both models predict labels for the unlabelled data. High-confidence predictions from one model are then added as labelled samples to the other model’s labelled set, enabling a cross-labelling strategy that allows both models to iteratively improve by using each other’s confident predictions. This process continues for an N number of times, until no further high-confidence predictions can be made or when all data has been labelled, as illustrated in the diagram below. Co-training is particularly effective when each feature set provides sufficient and independent information for classification, such as using webpage text and hyperlinks as distinct sources of data.
To support these core methods, some other techniques are often implemented to address situations where anomalies are scarce and unlabelled:
While these techniques enhance the robustness of anomaly detection models, it’s essential to recognise that such systems are often applied in sensitive areas, such as finance, healthcare, and security. This outlines several key concerns, such as interpretability, which plays a vital role in machine learning. While complex models may deliver high performance, their “black-box” nature can make results difficult to explain. To address this, tools such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) are commonly used to help uncover how models make their decisions. Other concerns must also be considered, such as bias in training data, lack of transparency, and privacy issues, all of which can result in unfair or harmful outcomes. Ensuring responsible use requires thoughtful model design, rigorous validation, and continuous monitoring.
Looking ahead, Anomaly Detection is becoming increasingly more relevant, with growing attention on real-time detection, explainability, and cross-domain adaptation. In many contexts, such as security or predictive maintenance, detecting anomalies as they happen is critical. As a result, new solutions are emerging with a strong focus on real-time capabilities. At the same time, explainability is gaining even more importance, particularly in sensitive or regulated fields, where it’s essential to understand the key factors or features that contribute most to anomalies. This insight helps stakeholders to not only trust the detection results but also take targeted actions to address underlying issues. Alongside this, there is a growing interest in transfer learning, a technique in which a model trained on one task or dataset is adapted for a different, yet related, task. In the context of anomaly detection, this enables a model to learn patterns from a domain with abundant labelled data and apply that knowledge to a new domain where labelled anomalies are scarce. By transferring previously learned features, the model can improve detection performance in new environments while reducing the need for extensive manual labelling.
In the future, the focus will be on building faster, more transparent, and adaptive systems capable of handling multivariate data, reacting to it in real-time, and justifying their decisions.
IBM. (2023). semi-supervised-learning. Retrieved from www.ibm.com: https://www.ibm.com/think/topics/semi-supervised-learning
Matthew Lau, T.-Y. Z. (2025). Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Theoretically-Grounded and Practical Framework with Synthetic Anomalies. arXiv. Retrieved from https://arxiv.org/abs/2506.13955
Milvus. (2025). what-is-the-future-of-anomaly-detection. Retrieved from milvus.io: https://milvus.io/ai-quick-reference/what-is-the-future-of-anomaly-detection