Markus Goldstein - goldiges.de

Anomaly Detection in Large Datasets

Abstract.

Anomaly detection is the task of finding instances in a dataset which are different from the norm. Today, anomaly detection is a core part of many data mining applications, for example in network intrusion detection, fraud detection, data leakage prevention, the identification of failures in complex systems, and diagnosis in the medical domain. In all of these applications, the amount of stored data has increased dramatically in the last decade, resulting in a strong demand for algorithms suitable for these crucial large-scale challenges. The main goal of this thesis is to bridge this gap and to provide efficient anomaly detection algorithms which are significantly faster than existing methods. A broad spectrum of algorithms is proposed, covering the trade-off between efficiency and accuracy for both unsupervised and semi-supervised anomaly detection. All presented methods reduce the overall computational complexity such that it is now possible to process large-scale datasets within an appropriate runtime.

In an analysis task, where a dataset with no labeling information should be processed, unsupervised anomaly detection algorithms are the methods of choice. In this thesis, a comprehensive evaluation of state-of-the-art unsupervised anomaly detection algorithms is performed, highlighting their strengths and weaknesses for both local and global anomaly detection problems. Based on an in-depth analysis of existing methods, novel and faster algorithms are proposed. In particular, two clustering-based approaches for the detection of local outliers as well as for estimating the clusters' densities with a higher precision are introduced. A new histogram-based algorithm, which assumes independence of features, results in a speed-up of up to 2,200 times compared to multivariate methods for detecting global anomalies. Additionally, a proposed expectation-maximization-based approach for computing a local outlier score in a multivariate setting now saves up to 94% of distance computations. Besides these theoretical contributions, two new application scenarios for unsupervised anomaly detection were investigated. First, a document authentication method is presented, which identifies suspicious documents printed with a different printing technique than the norm. The advantage of using unsupervised anomaly detection for forgery detection is that no prior training with verified genuine documents is required. A second novel application domain evolved in the context of security event log analysis. Here, anomalies and suspicious behavior in an enterprise computer network are detected by analyzing multiple log data sources. For evaluation, data from a large enterprise network was used and security flaws such as misconfiguration as well as incorrect classified accounts were detected.

The second part of this thesis focuses on semi-supervised anomaly detection in the context of mitigating Distributed Denial of Service (DDoS) attacks. In a DDoS attack, a hacker uses many, often compromised machines, in order to send an overwhelming amount of requests to a single service making it unavailable for regular users. In this work, the focus is put on highly distributed and spoofed attacks, which are very difficult to cope with using state-of-the-art mitigation technology. Therefore, the concept of predicting the source IP addresses for normal users is investigated in detail following a semi-supervised anomaly detection approach. During normal operation, the source IP distribution is learned and during an attack, sources which are unlikely to appear are blocked. Two practical approaches using this concept are introduced, which also take hardware limitations into account. It is shown that one approach developed outperforms existing methods by 32\% in terms of collateral damage (erroneously denying legal users). Additionally, a practical defense system was developed, which uses the proposed method as a first line of defense and further proceeds with a detailed attack analysis for determining more specific filter rules. In this detailed defense, each IP flow is scored according to its legitimate probability based on eight different features. Finally, as a more accurate scoring alternative in the future, a new multivariate algorithm is proposed: A decision tree learner was enhanced such that it serves as a semi-supervised anomaly detection algorithm. Its evaluation shows that it outperforms existing methods on three out of four standard datasets.

Altogether, the contributions of this thesis make it possible for the first time to apply anomaly detection algorithms on large-scale datasets. Far beyond what is achievable with state-of-the-art methods today, the technology presented opens new fields of application and research. It constitutes an important step for the application of anomaly detection algorithms on big data challenges in the near future.

More Information

The book can be purchased from the publisher's website or Amazon.

BibTeX | Citation | Publisher

Download Request

If you require my Thesis for research purposes, you can request a PDF download. To do so, please fill the form below with your name and e-mail address (preferably from a university/research organization) and verify it. You will receive a password protected download link within 48 hours.

Error: Something went wrong.

Thank you very much. The request has been sent. Please check your inbox now.