Modern recipes for anomaly detection 
Jean-Christophe Testud Jean-Christophe Testud
March 26 15 min

Modern recipes for anomaly detection 

Experimental corner: Our Element AI researchers are always working on putting cutting-edge AI science to work. We highlight their cool experiments, novel applications, and fun outputs in this occasional series. Today, Jean-Christophe Testud looks at anomaly detection with a guest appearance from Nicolas Cage.

Humans are always on the lookout for something weird or out of the ordinary. Our brain is in a constant state of anomaly detection. In Daniel Kahneman's theory, explained in his book Thinking, Fast and Slow, it is our instincts, what he calls System 1, that provide anomaly detection to help us identify the things that violate the internal models of the world. Then System 2, our slower, more analytical method of thinking, takes over to figure out what’s actually going on.

The same kinds of thinking apply to data, where anomaly detection is a big challenge. Data science often deals with large, unlabelled datasets. Much of the world’s data is collected in such sets, reams of information with inconsistent structures that are challenging to analyze. These datasets are nonetheless useful, and dealing with unlabelled datasets is a major focus in artificial intelligence known as unsupervised learning.

In order to detect anomalies in a large, unlabelled dataset, we first assume that most of the data is normal. For example, with credit card transactions, we can assume that the majority are proper, and flag those that fall outside those broad patterns, such as large purchases in different geographic areas. Anomaly detection can answer the question: “Is this something I know well or something different?” Depending on the use case, these anomalies are either discarded or investigated.

In the machine learning sense, anomaly detection is learning or defining what is normal, and using that model of normality to find interesting deviations/anomalies. And ironically, the field itself has no normal when it comes to talking about that which is common in the data versus uncommon outliers. Terms include normal vs. abnormal, usual vs. unusual, baseline vs. deviation, known vs. novel [1], inlier vs. outlier, and the list goes on.

Anomaly Detection Challenges

It is classification without labels

Anomaly detection models aim to produce a classifier that is able to tell whether a data point is normal or abnormal despite being trained entirely on the normal class. Choosing whether something is normal or abnormal is a two-class classification problem typically solved by supervised learning with a large and balanced mix of labelled points from the two classes. That doesn’t work when you have few or no positive samples (anomalies), and a lot of negative samples (normal). In those cases, anomaly detection is necessary.

The assumption is that there is not enough information about the second class, the anomalies, to learn anything useful. Instead, it’s possible to learn features from the normal class alone that can then be used to identify everything that does not look like the said normal class. If your normal class is dogs, you can identify four legs and fur as common features. In that case, when your dog-identifying model sees an iceberg it will know it’s an anomaly. Other techniques exist to deal with a limited amount of labeled samples, from fields such as semi-supervised learning, few-shot learning, and transfer learning. These can help get the most out of the existing labels in a dataset.

It needs a threshold

In general, in the process of training any anomaly detection algorithm, the resulting function is able to assign, for any observation, an anomaly score. Most data points will get low scores, and anomalies will hopefully stand out with higher ones. Anomaly detection needs a score threshold to make a final decision. That threshold will separate the usual from the anomalies, and it must be determined how high the score should be for the data to be considered an anomaly. Let's take a manufacturing example and say we want an algorithm to visually inspect some parts being produced. The threshold translates here to: How imperfect does a part have to be in order to be discarded?

With a few positive samples (anomalies), it is possible to optimize such thresholds. This process can also take into account the cost of false-positives and false-negatives.

It will produce false positives

One of the biggest issues with anomaly detection is that it often produces a high amount of false positives. This false positive problem comes primarily from the fact that the model is not optimized for the detection of specific samples. An anomaly detection algorithm will, by design, detect any unusual data, including benign examples.

Unusual and relevant are not always correlated

Relevance is subjective. If, in a use case and problem setting, unusual data isn’t relevant for the end-user, the results won’t be useful. To fix that problem, there are several avenues:

  • Modifying the task itself, by reformulating the problem, changing the question, or building another set of features.
  • Integrating human feedback: label the detections as true or false positives. This feedback can be used to start training a more typical supervised classifier, which makes it no longer pure anomaly detection.
  • Or just forgetting about anomaly detection. If what you are looking for in your data is very specific, and creating a balanced and representative labeled dataset is an option, you should probably take the time to create that dataset and go the supervised way

Existing approaches to anomaly detection

The following image is a good input for anomaly detection: one of these things is not like the other.

Existing approaches to anomaly detection.

We are going to use this image and similar data in the following examples.

Fitting a Gaussian

If linear regression is the simplest supervised learning model, then the anomaly detection equivalent is Gaussian fitting. Let’s say we have a dataset where each observation is a person. Each person is described by only one variable : their height, in centimetres. In the dataset we have some anomalies, such as two dogs inserted by mistake.

Gaussian fitting starts with a strong assumption on the distribution of your data, that it follows the normal, or Gaussian, distribution. A Gaussian distribution appears as a bell curve and is completely described by two parameters, its center (or mean) and its scale (standard deviation or variance). Fitting a Gaussian is just finding the best values for these two parameters to better fit the distribution of our samples. If our dataset is big enough and representative, we will probably end up with something like 175 cm as mean and 8 cm as the standard deviation.

Fitted Gaussian

Once this is done, we can tell the probability of appearance for any height value (including ones not present in our training data). By setting a threshold on this probability, like 0.1%, we can consider everything below to be an anomaly. This will likely detect our two dogs, who are well outside the core of the distribution.

A lot of basic anomaly detection techniques are variants of this one (with the same Gaussian assumption: multivariate Gaussians, Z-score, Grubbs, MAD).

Here are three more recent approaches:

  • Local Outlier Factor (LOF): Each data point is assigned a score (local outlier factor) based on the ratio between the local density of its nearest neighbors and its own local density. Points with a low density compared to their neighbors are considered anomalies.
  • Isolation Forest: Build decision trees to see how many random splits on random features it takes to isolate each data point. Easy to isolate points (lone/low density ones) are considered anomalies. Tightly packed ones are inliers.
  • One-class SVM: Find a hyperplane (a plane in 3+D space) with the bulk of the data on one side and little to no outliers on the other side. The plane act as a decision boundary for new data points.

For more on these algorithms, you can read this sklearn page.

Where Are My Neural Networks?

While these approaches are popular and useful, neural networks have some key advantages when dealing with complexity, such as the ability to work on high-dimensional data, at scale, with flexible architectures.

One recipe for anomaly detection with neural networks (or actually any trained mapping between inputs and outputs) is to:

  1. Find a task (anything really, be creative)
  2. Train a model to do the task on normal data
  3. Monitor how well it does on new unseen data
  4. Consider poor performance on new data as a sign of anomaly

Most of the ideas and tasks developed for unsupervised representation learning can be used to detect anomalies.

The Art of Task Engineering

This 4-step anomaly detection recipe is very generic. The first step, defining the task, is critical. It will determine what kind of model architecture you will be able to use. And, as we mentioned, it could also make the output (the anomalies) more relevant to your use case. Here are two families of task to kick-start your imagination:

  • Learn to reproduce your input (autoencoders): this is universal, it can be applied to anything, high reconstruction error is then your clue to detect anomalies.

Note: the old-school version of this is “PCA reconstruction error” (monitoring the error in a lossy compression/decompression pipeline)

  • Learn to predict some part of your data (self-supervised learning): this is also universal, but choosing what to predict is decisive (domain expertise steps in). Again, prediction / misclassification error is your proxy to detect anomalies.

Note: time-series forecasting can be a suitable self-supervised task for anomaly detection

If you choose a good task, one that is hard enough, the model will learn high-level concepts about the training data. It will find interesting correlations, interactions, regularities, and become confident in some predictions. These things will likely break for new observations that are very different.

As a bonus, some tasks are interpretable by design. If the variable you learned to predict is something tangible, looking side-by-side at the prediction and reality is already a basic form of explanation.

The Recipe in Practice

Let's follow the recipe and perform anomaly detection on a face dataset. Specifically, we are going to use the CelebA-HQ dataset which is a dataset containing 30,000 celebrity photos.. To learn interesting concepts about these faces, let’s think about a self-supervised task (step 1).

There are dozens of tasks we could choose from, here are some examples:

Anomaly detection recipe in practiceEdges to Faces, Black & White to Color, Inpainting randomly-placed patches, and Right from Left

For each one of these tasks, we can generate input/output pairs automatically without human intervention — the definition of self-supervised learning.

Let’s take the fourth example, predicting the right part of a face from the left part. This task is interesting because we already know that faces are more or less symmetrical. I trained a pix2pixHD model for 8 epochs, here are some results on new unseen celebrities:

Anomaly detection in celebrity faces

And now on dogs or similarly-irrelevant objects:

Anomaly detection in images of dogs and similarly irrelevant objects

Let’s see if we can exploit that failure to build an anomaly detection system.

Let’s have a look at the error distribution on unseen celebrity faces, and BigGAN Golden Retrievers:

Left to right prediction error

0.1 looks like a good threshold to complete our human face anomaly detection system.

On the blue histogram, you can see some rare but really big prediction error on faces. Here are three of them, one unexpected hand and two hard-to-guess backgrounds… Technically, these are unusual in a statistical sense.

Depending on what is expected, these could be considered false-positives. But, it is clear why, the true right part is plausible but unusual. A typical neural network will predict the most likely output, not the special case.


Vanilla GAN

GANs, the final frontier

A vanilla GAN job is to learn the probability distribution of the training data. It means that, if you show a GAN enough faces, it can in theory learn what makes a face and will ultimately be able to generate all the faces that could exist.

GANs can be and are used for anomaly detection.

Usually, a new unseen data point is considered an anomaly if it is hard to sample something similar by using the generator. If we retake our celebrity example, while sampling through the face generator, we should get closer to a new unseen face than to a new unseen dog.

You have a task (getting close to the target image), and a measure of performance (how far the sampled image is from the target image), one threshold later, you got yourself a GAN-based anomaly detection algorithm.

As it was the case with previously-described tasks, in order to detect change, we leverage robust high-level concepts a neural network has learned. GANs know a lot about your data (in theory, everything there is to know). As a consequence, GANs are ideal candidates for a lot of things.

If you like GANs and are interested in solving all of AI’s problems with them, have a look at this NeurIPS ’18 paper on robust MNIST classification with GANs.

Notes

[1] About the distinction between outlier and novelty detection, not everybody agrees on the terminology but, generally:

  • Outlier detection means finding anomalies in your training data (sometimes because you want them out of the dataset, or because you want to do a one-shot study of what is weird in a given dataset)
  • Novelty detection means finding anomalies in new data not seen at training time. Some people even consider that each training data point should be normal/uncontaminated. The algorithm would then circle the entire training data and give the same prediction (inlier) for each of the training data point

Note: To make it even less clear, some people use the term “anomaly detection” when actually performing direct supervised learning (directly predicting the labelled anomalies).

*The photos in this article are derivatives of "Progressive Growing of GANs for Improved Quality, Stability, and Variation – Official TensorFlow implementation of the ICLR 2018 paper" by Tero Karras (NVIDIA), Timo Aila (NVIDIA), Samuli Laine (NVIDIA), Jaakko Lehtinen (NVIDIA and Aalto University) is licensed under CC BY-NC 4.0