My Publications

Conferences, Journals, Books

Abstract:
Anomaly detection (also known as outlier detection) is the task of finding instances in a dataset which deviate markedly from the norm. Anomalies are often of specific interest in many real-world analytic tasks, since they can refer to incidents requiring special attention. The detection of suspicious activity can be helpful in both, post-incident investigation, or, in early-warning setups, where anomalies are detected in recent up-to-date datasets or even in streaming data.

PDF download | BibTeX | Citation | Publisher

Abstract:
Outlier detection is the process of detecting individual data points that deviate markedly from the majority of the data. Typical applications include intrusion detection and fraud detection. In comparison to the wellknown classification tasks in machine learning, commonly unsupervised learning techniques with unlabeled data are used in outlier detection. Recent algorithms mainly focus on detecting the outliers, but do not provide any insights what caused the outlierness. Therefore, this paper presents two model-dependent approaches to provide explainability in multivariate outlier detection using feature ranking. The approaches are based on the k-nearest neighbors and Gaussian Mixture Model algorithm. In addition, these approaches are compared to an existing method based on an autoencoder neural network. For a qualitative evaluation and to illustrate the strengths and weaknesses of each method, they are applied to one synthetically generated and two real-world data sets. The results show that all methods can identify the most relevant features in synthetic and real-world data. It is also found that the explainability depends on the model being used: The Gaussian Mixture Model shows its strength in explaining outliers caused by not following feature correlations. The k-nearest neighbors and autoencoder approaches are more general and suitable for data that does not follow a Gaussian distribution.

PDF download | BibTeX | Citation | Publisher

Abstract:
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

PDF download | BibTeX | Citation | Publisher

Abstract:
Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem becomes more important for large-scale datasets by the following two reasons: First, we will have a higher risk of “unexpected” outliers, such as mislabeled training data. Second, a large-scale dataset makes it more difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset. For accurate performance observation, we need to use a simple and describable recognition procedure and thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of 16×16 = 256, it is ensured that each digit class has at least 100 times more instances than data dimensionality. The experimental results show that the common understanding that outlier removal improves classification performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found that local anomaly detection algorithms perform better on this data than their global equivalents.

PDF download | BibTeX | Citation

Abstract:
The detection of anomalous behavior in log and sensor data is an often requested task for many data mining applications. If there are no labels available in the dataset as in many real-world setups, unsupervised anomaly detection would be the method of choice. Since these algorithms are not directly applicable on the data in general, an appropriate transformation has to be performed first. This paper describes how such "data views" could be generated with respect to the detection goal. It is also shown how contexts and associated events are taken into account correctly when creating the data view. Furthermore, a comparative evaluation of 11 different unsupervised anomaly detection algorithms on standardized datasets reveal useful strategies for selecting an appropriate algorithm. Finally, a real-world example of anomaly detection in power consumption data proves the usefulness of the presented methodology.

PDF download | BibTeX | Citation

Abstract:
Anomaly detection is the task of finding instances in a dataset which are different from the norm. Today, anomaly detection is a core part of many data mining applications, for example in network intrusion detection, fraud detection, data leakage prevention, the identification of failures in complex systems, and diagnosis in the medical domain. In all of these applications, the amount of stored data has increased dramatically in the last decade, resulting in a strong demand for algorithms suitable for these crucial large-scale challenges. The main goal of this thesis is to bridge this gap and to provide efficient anomaly detection algorithms which are significantly faster than existing methods. A broad spectrum of algorithms is proposed, covering the trade-off between efficiency and accuracy for both unsupervised and semi-supervised anomaly detection. All presented methods reduce the overall computational complexity such that it is now possible to process large-scale datasets within an appropriate runtime. Far beyond what is achievable with state-of-the-art methods today, the technology presented opens new fields of application and research.

More Information and Download | BibTeX | Citation | Publisher

Abstract:
This chapter gives an overview of a large range of anomaly detection methods and introduces the RapidMiner Anomaly Detection Extension. Anomaly detection is the process of finding patterns in a given dataset which deviate from the characteristics of the majority. These outstanding patterns are also known as anomalies, outliers, intrusions, exceptions, misuses, or fraud. Anomaly detection identifies single records in datasets which significantly deviate from the normal data. Application domains among others include network security, intrusion detection, computer virus detection, fraud detection, misuse detection, complex system supervision, and finding suspicious records in medical data. Anomaly detection for fraud detection is used to detect fraudulent credit card transactions caused by stolen credit cards, fraud in Internet payments, and suspicious transactions in financial accounting data. In the medical domain, anomaly detection is also used, for example, for detecting tumors in medical images or monitoring patient data (electrocardiogram) to get early warnings in case of life-threatening situations. Furthermore, a variety of other specific applications exists such as anomaly detection in surveillance camera data, fault detection in complex systems or detecting forgeries in the document forensics. Despite the differences of the various application domains, the basic principle remains the same. Multivariate normal data needs to be modeled and the few deviations need to be detected, preferably with a score indicating their "outlierness", i.e., a score indicating their extent of being an outlier. In case of a univariate data, such an outlier factor could for example be the number of standard deviations by which an outlier differs from the mean of this variable.

More Information | BibTeX | Citation | Publisher

Abstract:
Support Vector Machines (SVMs) have been one of the most successful machine learning techniques for the past decade. For anomaly detection, also a semi-supervised variant, the one-class SVM, exists. Here, only normal data is required for training before anomalies can be detected. In theory, the one-class SVM could also be used in an unsupervised anomaly detection setup, where no prior training is conducted. Unfortunately, it turns out that a one-class SVM is sensitive to outliers in the data. In this work, we apply two modifications in order to make one-class SVMs more suitable for unsupervised anomaly detection: Robust one-class SVMs and eta one-class SVMs. The key idea of both modifications is, that outliers should contribute less to the decision boundary as normal instances. Experiments performed on datasets from UCI machine learning repository show that our modifications are very promising: Comparing with other standard unsupervised anomaly detection algorithms, the enhanced one-class SVMs are superior on two out of four datasets. In particular, the proposed eta one-class SVM has shown the most promising results.

PDF download | BibTeX | Citation | Publisher

Abstract:
Automatically identifying that a certain page in a set of documents is printed with a different printer than the rest of the documents can give an important clue for a possible forgery attempt. Different printers vary in their produced printing quality, which is especially noticeable at the edges of printed characters. In this paper, a system using the difference in edge roughness to distinguish laser printed pages from inkjet printed pages is presented. Several feature extraction methods have been developed and evaluated for that purpose. In contrast to previous work, this system uses unsupervised anomaly detection to detect documents printed by a different printing technique than the majority of the documents among a set. This approach has the advantage that no prior training using genuine documents has to be done. Furthermore, we created a dataset featuring 1200 document images from different domains (invoices, contracts, scientific papers) printed by 7 different inkjet and 13 laser printers. Results show that the presented feature extraction method achieves the best outlier rank score in comparison to state-of-the-art features.

PDF download | BibTeX | Citation | Publisher

Abstract:
Security Information and Event Management (SIEM) systems are today a key component of complex enterprise networks. They usually aggregate and correlate events from different machines and perform a rule-based analysis to detect threats. In this paper we present an enhancement of such systems which makes use of unsupervised anomaly detection algorithms without the need for any prior training of the system. For data acquisition, events are exported from an existing SIEM appliance, parsed, unified and preprocessed to fit the requirements of unsupervised anomaly detection algorithms. Six different algorithms are evaluated qualitatively and finally a global k-NN approach was selected for a practical deployment. The new system was able to detect misconfigurations and gave the security operation center team more insight about processes in the network.

PDF download | BibTeX | Citation | Publisher

Abstract:
Unsupervised anomaly detection techniques are becoming more and more important in a variety of application domains such as network intrusion detection, fraud detection and misuse detection. Today, unsupervised anomaly detection techniques are mainly based on quadratic complexity making it almost impossible to apply them on very large data sets. In this paper, an Expectation-Maximization algorithm is proposed which computes the Local Outlier Factor (LOF) incrementally and up to 80% faster than the standard method. Another advantage of FastLOF is that intermediate results can be used by a system already during computation. Evaluation on real world data sets reveal that FastLOF performs comparable to the best outlier detection algorithms although being significantly faster.

PDF download | BibTeX | Citation | Publisher

Abstract:
Unsupervised anomaly detection is the process of finding outliers in data sets without prior training. In this paper, a histogram-based outlier detection (HBOS) algorithm is presented, which scores records in linear time. It assumes independence of the features making it much faster than multivariate approaches at the cost of less precision. A comparative evaluation on three UCI data sets and 10 standard algorithms show, that it can detect global outliers as reliable as state-of-the-art algorithms, but it performs poor on local outlier problems. HBOS is in our experiments up to 5 times faster than clustering based algorithms and up to 7 times faster than nearest-neighbor based methods.

PDF download | BibTeX | Citation

Abstract:
Unsupervised anomaly detection is the process of finding outlying records in a given dataset without prior need for training. In this paper we introduce an anomaly detection extension for RapidMiner in order to assist non-experts with applying eight different nearest-neighbor and clustering based algorithms on their data. A focus on efficient implementation and smart parallelization guarantees its practical applicability. In the context of clustering-based anomaly detection, two new algorithms are introduced: First, a global variant of the cluster-based local outlier factor (CBLOF) is introduced which tries to compensate the shortcomings of the original method. Second, the local density cluster-based outlier factor (LDCOF) is introduced which takes the local variances of clusters into account. The performance of all algorithms have been evaluated on real world datasets from the UCI machine learning repository. The results reveal the strengths and weaknesses of the single algorithms and show that our proposed clustering based algorithms outperform CBLOF significantly.

PDF download | BibTeX | Citation

Abstract:
Choosing a suitable classifier for a given dataset is an important part of developing a pattern recognition system. Since a large variety of classification algorithms are proposed in literature, non-experts do not know which method should be used in order to obtain good classification results on their data. Meta-learning tries to address this problem by recommending promising classifiers based on meta-features computed from a given dataset. In this paper, we empirically evaluate five different categories of state-of-the-art meta-features for their suitability in predicting classification accuracies of several widely used classifiers (including Support Vector Machines, Neural Networks, Random Forests, Decision Trees, and Logistic Regression). Based on the evaluation results, we have developed the first open source meta-learning system that is capable of accurately predicting accuracies of target classifiers. The user provides a dataset as input and gets an automatically created high-performance ready-to-use pattern recognition system in a few simple steps. A user study of the system with non-experts showed that the users were able to develop more accurate pattern recognition systems in significantly less development time when using our system as compared to using a state-of-the-art data mining software.

PDF download | BibTeX | Citation | Publisher

Abstract:
In machine learning, picking the optimal classifier for a given problem is a challenging task. A recent research field called meta-learning automates this procedure by using a meta-classifier in order to predict the best classifier for a given dataset. Using regression techniques, even a ranking of preferred learning algorithms can be determined. However, all methods are based on a prior extraction of meta-features from datasets. Landmarking is a recent method of computing meta-features, which uses the accuracies of some simple classifiers as characteristics of a dataset. Considered as the first meta-learning step in RapidMiner, a new operator called landmarking has been developed. Evaluations based on 90 datasets, mainly from the UCI repository, show that the landmarking features from the proposed operator are useful for predicting classifiers' accuracies based on regression.

PDF download | BibTeX | Citation

Abstract:
Source IP addresses are often used as a major feature for user modeling in computer networks. Particularly in the field of Distributed Denial of Service (DDoS) attack detection and mitigation traffic models make extensive use of source IP addresses for detecting anomalies. Typically the real IP address distribution is strongly undersampled due to a small amount of observations. Density estimation overcomes this shortage by taking advantage of IP neighborhood relations. In many cases simple models are implicitly used or chosen intuitively as a network based heuristic. In this paper we review and formalize existing models including a hierarchical clustering approach first. In addition, we present a modified k-means clustering algorithm for source IP density estimation as well as a statistical motivated smoothing approach using the Nadaraya-Watson kernel-weighted average. For performance evaluation we apply all methods on a 90 days real world dataset consisting of 1.3 million different source IP addresses and try to predict the users of the following next 10 days. ROC curves and an example DDoS mitigation scenario show that there is no uniformly better approach: k-means performs best when a high detection rate is needed whereas statistical smoothing works better for low false alarm rate requirements like the DDoS mitigation scenario.

PDF download | Online Demo | BibTeX | Citation | Publisher

Abstract:
Distributed Denial of Service (DDoS) attack mitigation systems usually generate a list of filter rules in order to block malicious traffic. In contrast to this binary decision we suggest to use traffic shaping whereas the bandwidth limit is defined by the probability of a source to be a legal user. As a proof of concept, we implemented a simple high performance Linux kernel module nf-HiShape which is able to shape thousands of source IP addresses at different bandwidth limits even under high packet rates. Our shaping algorithm is comparable to Random Early Detection (RED) applied on every single source IP range. The evaluation shows, that our kernel module can handle up to 50,000 IP ranges at nearly constant throughput whereas Linux tc already decreases throughput at about 200 ranges.

PDF download | BibTeX | Citation | Publisher

Abstract:
In this paper a modified decision tree algorithm for anomaly detection is presented. During the tree building process, densities for the outlier class are used directly in the split point determination algorithm. No artificial counter-examples have to be sampled from the unknown class, which yields to more precise decision boundaries and a deterministic classification result. Furthermore, the prior of the outlier class can be used to adjust the sensitivity of the anomaly detector. The proposed method combines the advantages of classification trees with the benefit of a more accurate representation of the outliers. For evaluation, we compare our approach with other state-of-the-art anomaly detection algorithms on four standard data sets including the KDD-Cup 99. The results show that the proposed method performs as well as more complex approaches and is even superior on three out of four data sets.

PDF download | BibTeX | Citation | Publisher

Abstract:
Distributed Denial of Service (DDoS) attacks are today the most destabilizing factor in the global internet and there is a strong need for sophisticated solutions. We introduce a formal statistical framework and derive a Bayes optimal packet classifier from it. Our proposed practical algorithm "Adaptive History-Based IP Filtering" (AHIF) mitigates DDoS attacks near the victim and outperforms existing methods by at least 32% in terms of collateral damage. Furthermore, it adjusts to the strength of an ongoing attack and ensures availability of the attacked server. In contrast to other adaptive solutions, firewall rulesets used to resist an attack can be precalculated before an attack takes place. This ensures an immediate response in a DDoS emergency. For evaluation, simulated DDoS attacks and two real-world user traffic datasets are used.

PDF download | BibTeX | Citation | Publisher

Patents

Abstract:
The invention describes a method and system of protecting computer systems from attacks over a network to which the computer system is connected, the method comprising the steps of (a) establishing, during attack-free operation of the computer system, a database in the form of a source-IP-histogram storing all request received from all sender at the computer system; (b) calculating and storing a smoothed source-IP-histogram from the source-IP-histogram obtained in step a); (c) applying a probability threshold on the smoothed source-IP-histogram to differentiate between acceptable sender and sender to be rejected; (d) monitoring requests to the computer system; (e) accepting a new sender if its assumed probability value derived from the smoothed-IP-histogram exceeds the threshold.


Countries: AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR
Status: Granted

PDF download | BibTeX | Citation | EP Register

Abstract:
The invention describes a method and system of protecting computer systems from attacks over a network to which the computer system is connected, the method comprising the steps of (a) monitoring current requests to the computer system; (b) measuring one or more network features on the basis of the current requests; (c) providing a graphical user interface visualizing the measured one or more network features for at least one geographical area of origin of requests; (d) receiving a user input selecting at least one geographical area; (e) accessing on the basis of the selected at least one geographical area a database associating for the at least one geographical area of origin each country with corresponding IP addresses; and (f) automatically generating firewall rules for the computer system on the basis of the IP addresses retrieved from the database for the at least one selected geographical area.

Countries: AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR
Status: Granted

PDF download | BibTeX | Citation | EP Register

Abstract:
The invention describes a method and system of protecting computer systems from attacks over a network to which the computer system is connected, the method comprising the steps of (a) establishing, during attack-free operation of the computer system, a regular request-time distribution for countries of origin of requests to the computer system; (b) monitoring current requests to the computer system; (c) determining the current amount of requests for the at least one country of origin; (d) during a detected attack comparing the current amount of requests for the at least one country of origin with the regular request-time distribution for at least one country of origin; and (e) restricting the number of requests from this country served by the computer system if it is determined as a result of step d) that the current amount of requests deviates from the expected amount according to the regular request-time distribution.

Countries: AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR
Status: Granted

PDF download | BibTeX | Citation | EP Register

Abstract:
In a medical system architecture and method for interactive transmission and progressive representation of compressed image data of multi-component images, image data are obtained at least one modality with computer workstations being associated with the respective modalities to process the examination images. A device is provided to transfer the examination images; and a device is provided for storage of data and examination images; and further user workstations are provided for post-processing of the examination images. The image data are compressed and organized and stored in packets with specific parameters such that access to individual packets is possible; and the packetized data are decompressed packet-by-packet based on a request from a user workstation, such that multi-component images are generated with progressive parameters.

Countries: DE US CN JP
Status: Granted

PDF download (US) | PDF (DE) | PDF (CN) | PDF (JP) | BibTeX | Citation | US Patent Office