Abstract.
Anomaly detection is the task of finding instances in a dataset which are
different from the norm. Today, anomaly detection is a core part of many data
mining applications, for example in network intrusion detection, fraud
detection, data leakage prevention, the identification of failures in complex
systems, and diagnosis in the medical domain. In all of these applications, the
amount of stored data has increased dramatically in the last decade, resulting
in a strong demand for algorithms suitable for these crucial large-scale
challenges. The main goal of this thesis is to bridge this gap and to provide
efficient anomaly detection algorithms which are significantly faster than
existing methods. A broad spectrum of algorithms is proposed, covering the
trade-off between efficiency and accuracy for both unsupervised and
semi-supervised anomaly detection. All presented methods reduce the overall
computational complexity such that it is now possible to process large-scale
datasets within an appropriate runtime.
In an analysis task, where a dataset with no labeling information should be
processed, unsupervised anomaly detection algorithms are the methods of choice.
In this thesis, a comprehensive evaluation of state-of-the-art unsupervised
anomaly detection algorithms is performed, highlighting their strengths and
weaknesses for both local and global anomaly detection problems. Based on an
in-depth analysis of existing methods, novel and faster algorithms are proposed.
In particular, two clustering-based approaches for the detection of local
outliers as well as for estimating the clusters' densities with a higher
precision are introduced. A new histogram-based algorithm, which assumes
independence of features, results in a speed-up of up to 2,200 times compared
to multivariate methods for detecting global anomalies. Additionally, a
proposed expectation-maximization-based approach for computing a local outlier
score in a multivariate setting now saves up to 94% of distance computations.
Besides these theoretical contributions, two new application scenarios for
unsupervised anomaly detection were investigated. First, a document
authentication method is presented, which identifies suspicious documents
printed with a different printing technique than the norm. The advantage of
using unsupervised anomaly detection for forgery detection is that no prior
training with verified genuine documents is required. A second novel application
domain evolved in the context of security event log analysis. Here, anomalies
and suspicious behavior in an enterprise computer network are detected by
analyzing multiple log data sources. For evaluation, data from a large
enterprise network was used and security flaws such as misconfiguration as well
as incorrect classified accounts were detected.
The second part of this thesis focuses on semi-supervised anomaly detection in
the context of mitigating Distributed Denial of Service (DDoS) attacks. In a
DDoS attack, a hacker uses many, often compromised machines, in order to send
an overwhelming amount of requests to a single service making it unavailable
for regular users. In this work, the focus is put on highly distributed and
spoofed attacks, which are very difficult to cope with using state-of-the-art
mitigation technology. Therefore, the concept of predicting the source IP
addresses for normal users is investigated in detail following a semi-supervised
anomaly detection approach. During normal operation, the source IP distribution
is learned and during an attack, sources which are unlikely to appear are
blocked. Two practical approaches using this concept are introduced, which also
take hardware limitations into account. It is shown that one approach developed
outperforms existing methods by 32\% in terms of collateral damage (erroneously
denying legal users). Additionally, a practical defense system was developed,
which uses the proposed method as a first line of defense and further proceeds
with a detailed attack analysis for determining more specific filter rules. In
this detailed defense, each IP flow is scored according to its legitimate
probability based on eight different features. Finally, as a more accurate
scoring alternative in the future, a new multivariate algorithm is proposed: A
decision tree learner was enhanced such that it serves as a semi-supervised
anomaly detection algorithm. Its evaluation shows that it outperforms existing
methods on three out of four standard datasets.
Altogether, the contributions of this thesis make it possible for the first
time to apply anomaly detection algorithms on large-scale datasets. Far beyond
what is achievable with state-of-the-art methods today, the technology
presented opens new fields of application and research. It constitutes an
important step for the application of anomaly detection algorithms on big data
challenges in the near future.
More Information
The book can be purchased from the publisher's website or Amazon.
Download Request
If you require my Thesis for research purposes, you can request a PDF download. To do so, please fill the form below with your name and e-mail address (preferably from a university/research organization) and verify it. You will receive a password protected download link within 48 hours.