Securing machine learning models against adversarial attacks
Rey Reza Wiyatno Rey Reza Wiyatno
January 9 17 min

Securing machine learning models against adversarial attacks

What are adversarial defences?

In a previous article, we covered adversarial examples in modern machine learning, why they are important and how to generate them. Here, we present adversarial defence methods used to counter these attacks.

Adversarial defences are techniques used to protect against adversarial attacks. The arms race between adversarial attacks and defences is ongoing and intensifying. Many adversarial defence methods have been proposed in the last few years, but none of them guarantee security against all adversarial examples and inputs. Robust adversarial defence is a critical area of ongoing research, crucial to trustable machine learning solutions.

A Quick Glossary

Let’s take a look at several terms that are often used in adversarial defence:

  • Adversarial perturbation: the difference between a non-adversarial example and its adversarial counterpart
  • Adversarial robustness: the property of resisting misclassification of adversarial examples.
  • Adversarial detection: a set of methods to detect adversarial examples.
  • Adversarial training (Goodfellow et al., 2014): a defence technique in which a model is trained on adversarial examples. It is important to note that the notion of adversarial training as a defence is not the same as the notion of adversarial training as used in the Generative Adversarial Networks (Goodfellow et al., 2014).
  • Gradient masking (Papernot et al., 2016a; Papernot et al., 2016b; Tramèr et al., 2017): the practice of altering a model to hide its original gradients from an attacker. In other words, the defence methods mask the gradients of the model’s output with respect to its inputs.
  • Obfuscated gradients (Athalye et al., 2018): a form of gradient masking which encompasses shattered, stochastic, vanishing, and exploding gradients.
  • Shattered gradients: when the gradients of a model are hard to compute due to non-differentiable operations.
  • Stochastic gradients: when the gradients of a model are hard to compute due to some stochastic operations.
  • Vanishing gradients: when the gradients of a model are small or close to zero.
  • Exploding gradients: when the gradients of a model are extremely large or close to infinity.

As in the previous article, we will discuss several adversarial defence methods, how they evolve throughout time, and how some of them can be circumvented.

Defence strategies

Adversarial Defenses

Ontology of adversarial defences discussed in this article. Note that this only covers a subset of the defence methods that exist today.

Increasing adversarial robustness through data augmentation

This is a family of defence methods which involves data augmentation. The basic idea is to include adversarial examples in the training set. The defender will use these examples to enforce invariance in model’s output given the original example and its adversarial counterpart. This is related to the classical forms of data augmentation, in which data is transformed in ways that reflect the invariances we wish the model to exhibit (e.g., for images, we might rotate and/or randomly crop our data, and still expect the model to identify the object correctly).

Adversarial Training

Adversarial training (Goodfellow et al., 2014) is a defence method used to increase adversarial robustness by retraining a model on adversarial examples. In adversarial training, adversarial examples are generated at each iteration based on current state of the model, and are used to retrain the model.

Adversarial Training

Illustration of adversarial training. In every training iteration, adversarial examples are generated based on the current state of the model and used along the original examples to retrain the model.

One can opt to adjust the training objective by adding more weights for the correct classification of adversarial examples. Formally, the training objective for adversarial training can be defined as:

Adversarial Training Objective

Training objective for adversarial training. Here, L(x,y) denotes the loss function used (i.e., cross-entropy), (x,y) denotes input-label pair of clean examples, x’ denotes the adversarial counterpart of x, and α denotes the weighting factor for the two loss terms. In practice, α is often set to 0.5.

However,Tramèr et al. (2017) showed that naively performing adversarial training on limited sets of adversarial examples can lead to false robustness due to gradient masking. They showed that performing adversarial training on FGSM (Goodfellow et al., 2014) adversaries alone can lead the model to have local loss gradients pointed away from where the global loss of the model is at its maximum. As a result, the defended model will only be robust against certain types of attacks, but may still be circumventable by different attack methods.

Adversarial Loss Gradient

A specific example of gradient masking adapted from Tramèr et al. (2017). The gradients of the model may deceive the attacker since the local gradient at the starting point (0,0) will be larger in the direction of ϵ-1, although the loss value is actually lower compared to the direction of ϵ-2 for larger ϵ values.

Looking from robust optimization perspective, Madry et al. (2017) showed a stronger adversarial training strategy that does not rely on gradient masking by training the model only on the strongest available adversaries (i.e., worst case adversaries). They argued that adversarial examples generated using BIM (Kurakin et al., 2017), which they called the Projected Gradient Descent (PGD) method, are the worst case adversarial examples. In practice, they train the model on adversarial examples generated by randomly initiated PGD (i.e., R+BIM instead of R+FGSM (Tramèr et al., 2017)).

Athalye et al. (2018) showed how they successfully circumvented 7 out of 9 defences accepted to the Sixth International Conference on Learning Representations (ICLR) 2018, which included the work from Madry et al (2017) on PGD adversarial training. Interestingly, they concluded that adversarial training on PGD adversaries (Madry et al., 2017) is the only defence evaluated that can live up to its claim.

Adversarial Logit Pairing

Adversarial Logit Pairing or ALP (Kannan et al., 2018) is a method which enforces invariance between logits of normal and adversarial inputs. This is done by introducing a penalty term on the training objective. Given a mini-batch of original examples x and their adversarial counterparts x’ generated by the randomly initialized PGD attack proposed in Madry et al. (2017), ALP modifies the training objective to be:

Adversarial logit Pairing

where Z(x) denotes the logits given an input x, and λ denotes a constant that controls the penalty term.

The first term of the training objective is the same adversarial training objective, as defined in the previous section where L(x,y) denotes the loss function used (i.e., cross-entropy), and α denotes the weighting factor for benign and adversarial examples. The second term is the logit regularization term which encourages logit invariance across adversarial and non-adversarial inputs. Note that Kannan et al. (2018) used the L2 norm to measure the distance between the logits pair, however this can be replaced by other similarity metrics.

Kannan et al. (2018) showed that the combination of adversarial training on PGD adversaries with ALP resulted in state of the art defence against both whitebox and blackbox attacks on ImageNet models. However, Engstrom et al. (2018) found that ALP can still be attacked if one performs enough PGD steps.

Removing Adversarial Perturbations Through Input Transformation

This family of defences attempts to remove adversarial perturbations by performing transformation(s) to an input before the input is passed to the model. This can either be achieved by using a generative transformation model or through any image transformation techniques such as JPEG compression, total variance minimization, or image quilting (Guo et al., 2017).


PixelDefend (Song et al., 2018) is a defence strategy that aims to perform input “purification” through the use of a generative transformation model. This is done by first training a generative model, a PixelCNN (Oord et al., 2016) in this case, on a non-adversarial dataset in order to approximate the distribution of the dataset. Once the PixelCNN is trained, one can pass an input to the PixelCNN with the hope that PixelCNN will remove the perturbations in the case of adversarial input. The PixelDefend algorithm is presented below.

Pixel Defend

Modified algorithm of PixelDefend for 8-bits RGB image input (adapted from the paper).

Note that one needs to set the value of ϵ-defend that controls the maximum change in pixel value during the purification process. If ϵ-defend is too large, the purified input may not look like the original input anymore, which may cause accidental misclassification. If ϵ-defend is too small, the purified input may still be adversarial. In an ideal situation, if we know the attacker can only generate adversarial examples within the bound of ϵ-attack, then ϵ-defend should be set to be equal to ϵ-attack.

PixelDefend was shown to be effective against various attacks such as the FGSM (Goodfellow et al., 2014), BIM (Kurakin et al., 2017), DeepFool (Moosavi-Dezfooli et al., 2015), and C&W (Carlini & Wagner, 2016). Unfortunately, PixelDefend was later found to be vulnerable against Backward Pass Differentiable Approximation (BPDA) (Athalye et al, 2018) and Simultaneous Perturbation Stochastic Approximation (SPSA) (Uesato et al., 2018).

Adversarial Robustness by Regularizing the Gradient of the Model

Considering that many attacks rely heavily on the gradients of the output with respect to the input, a defence method can be achieved by training the model to have small and noisy gradients. This strategy makes it harder for attacks like the FGSM (Goodfellow et al., 2014) to create adversarial examples.

Gradient Naturalization

Illustration of gradient regularization. This strategy may cause the gradients to be weak and noisy, therefore harder to be exploited by an attacker.

Deep Contractive Network

Deep Contractive Network or DCN (Gu & Rigazio, 2015) is a model whose training objective includes a regularization term for the gradients of the model.

Deep Contractive Network

Training objective of DCN. Here, N denotes the number of training data, L(x,y) denotes the classification loss (e.g., cross-entropy), H denotes the number of hidden layers, λ denotes a constant that controls the weight of the regularization term, and h denotes the hidden activation units.

In above formula, the goal of the regularization term is to penalize the norm of the gradients of a hidden layer with respect to the previous hidden layer. This regularization term is known as the layer-wise contractive penalty, which is adopted from the Contractive Autoencoders (Rifai et al., 2011). Intuitively, this method allows the network’s output to be less sensitive to small changes in its input.

Gu & Rigazio (2015) tested this strategy against the L-BFGS attack (Szegedy et al., 2014). They showed that while the attack still successfully found adversarial examples that fool DCN, the average perturbations was larger compared to when attacking undefended model. This perhaps is one of the earliest defence methods proposed since Szegedy et al. (2014) published their findings on adversarial examples. DCN seems to rely on gradient masking, thus it needs to be re-evaluated against new attacks, especially those which can circumvent gradient masking.

Defensive Distillation

Defensive distillation (Papernot et al., 2015) is a defence method using a technique known as knowledge distillation (Hinton et al., 2015). This method works by training two networks; the teacher and student networks. First, we train the teacher network on a dataset by introducing a temperature constant T in the softmax function:

Softmax Function

Softmax function with temperature constant T. Here, C denotes the number of class and Z(x) denotes the logits vector.

After the teacher network is trained, the predictions of the teacher network on the training set are then used as labels to train the student network. The softmax function also includes the temperature constant during training of the student network. The temperature is then set back to 1 during test time, once training is done.

Papernot et al. (2015) showed that defensive distillation causes the gradients of the output with respect to the inputs to be inversely proportional with T. This means as T increases, the gradients of the outputs with respect to the inputs become smaller. As a result, this defence method is hard to be attacked by attack methods that rely heavily on gradients, such as the FGSM (Goodfellow et al., 2014) and JSMA (Papernot et al., 2018) when T is high. Unfortunately, this defence method has successfully been circumvented by the C&W attack (Carlini & Wagner, 2016).

Defensive Distillation

Illustration of defensive distillation.

Introducing Randomness to Mitigate Against Adversarial Examples

Instead of regularizing the gradients, randomness can be introduced during both training and inference time as a defence strategy. Randomization challenges the attackers when calculating the exact gradients that will lead to adversarial examples, due to the stochasticity.

Randomness and Mitigation of Adversarial Examples

Illustration of randomization strategy as a defence. Randomization can be done at the input (e.g., quilting (Guo et al., 2017)) and model levels (e.g., SAP (Dhillon et al., 2018)).

Stochastic Activation Pruning

Stochastic Activation Pruning or SAP (Dhillon et al., 2018) introduces randomness by pruning or dropping (i.e., set to 0) some of the activation nodes in a neural network during test time. One may see this strategy to be similar to dropout (Srivastava et al., 2014). The difference between SAP and dropout is that the probability of the neurons to be pruned in SAP is inversely proportional to the magnitude of the activations. In other words, the larger the magnitude of an activation, the less likely it is for that unit to be pruned. The probability p of the nodes j at layer i to be sampled (i.e., survive) is given as:

Stochastic Activation Pruning

where a(i) denotes the number of activation units at the i-th layer.

Additionally, SAP rescales the magnitude of the survivors by the inverse of the probability. That is:

Stochastic Activation Pruning

where r(i) denotes the number of samples to be drawn at the i-th layer. This is done so that the model behaves similarly to the unpruned model in order to retain classification accuracy on clean examples. As a result, SAP can be applied to any pre-trained models without the need of fine-tuning.

SAP Algorithm

SAP was shown to be robust against FGSM (Goodfellow et al., 2014) and BIM (Kurakin et al., 2017). However, SAP has been shown to rely on stochastic gradients, which is a form of obfuscated gradients. SAP can be circumvented if an attacker is aware of the defence strategy in place by calculating the expectation over stochastic gradients on multiple forward passes (Athalye et al, 2018).

Detecting Adversaries Through Classification

Given a set of clean and adversarial examples, this detection method aims to classify between non-adversarial and adversarial examples. This is done by training a classifier whose objective is to differentiate between adversarial and non-adversarial inputs.

Detecting Adversarial Examples

Illustration of detecting adversarial examples through classification. A classifier that differentiates between clean and adversarial examples can be trained and then used during test time to filter out suspicious inputs.

Adversary Detection Network

Metzen et al. (2017) proposed a detection method by training a sub-detector network that takes the output of a neural network classifier at a certain layer as an input. The sub-detector network is trained to differentiate between benign and adversarial inputs. The training of the detector network involves generating adversarial examples to be included in the training set for the detector network. In order to take into account future attacks, Metzen et al. (2017) suggested to train the detector following the adversarial training (Goodfellow et al., 2014) procedure on BIM (Kurakin et al., 2017) adversaries.

Modified BIM formulation

Modified BIM formulation where J denotes the loss function of the model, n denotes the number of iteration, and α is a constant that controls the magnitude of the perturbations (Metzen et al., 2017). The Clip{} function ensures that the adversarial example generated is still within the range of both the ϵ ball (i.e., [x-ϵ, x+ϵ]) and the input space (i.e., [0, 255] for pixel values). Here, σ controls whether the attacker cares more about classification or detection loss.

Metzen et al. (2017) found that placing the detector network at different layer depths gives different results for different types of attacks. Finally, they showed how a classifier that includes a detector network is harder to be fooled since an attacker needs to find adversaries that fool both the classifier and detector at the same time. However, Carlini & Wagner (2017) found that the detection network is tricky to be trained, has relatively high false positive rate against the C&W attacks (Carlini & Wagner, 2016), and can be circumvented with the substitute blackbox attack (Papernot et al., 2016).

Detecting Adversaries Through Distribution Matching

This family of detection techniques resorts on calculating the distribution estimate, to detect whether a datapoint is located outside of the data distribution.

Kernel Density Estimates

Feinman et al. (2017) proposed the Kernel Density Estimates (KDE) method in order to tell if a data point is located far away from a class manifold. The KDE is calculated by

Kernel Density Estimates (KDE) Method

where Xt denotes a set of data from class t, |Xt| denotes the number of examples in Xt, Z(x) denotes the logits of the model given x as an input, and σ denotes the Gaussian bandwidth since the KDE is estimated using the Gaussian kernel. An adversarial example x’ should give low KDE(x’) and thus is detectable. However, Feinman et al. (2017) suggested that this method does not work well when x’ is near the target class manifold.

Kernel Density Estimates (KDE) illustration

Illustration of Kernel Density Estimates (KDE). One can measure the density estimates by calculating the KDE between a given input and several non-adversarial data points that we know belong to a particular class.

Carlini & Wagner (2017) found this method to perform poorly on CIFAR10 and that this method can be circumvented by iterative optimization based attacks. That is, if an attacker is aware of the detection method used, the attacker can fold in the KDE measurement into the optimization objective in order to generate adversarial examples that can fool both the detection and classification models.

Detecting Adversaries by Measuring the Uncertainty Estimates

This family of detection methods detects whether an input is adversarial or not by measuring the uncertainty of the model. The hypothesis is that the model should have high uncertainty when dealing with adversarial inputs.

Bayesian Uncertainty Estimates

Feinman et al. (2017) suggested to measure the uncertainty of the model through a Bayesian process. This can be done using dropout (Srivastava et al., 2014), not only during training time, but also during test time since this has been previously shown to approximate a deep Gaussian process (Gal & Ghahramani, 2015). The Bayesian uncertainty estimates can then be measured by performing several forward passes, since every forward pass may produce different output due to the stochasticity of the model. Formally, given a classifier f, the uncertainty can be computed by

Bayesian Uncertainty Estimates formula

where N denotes the number of forward passes, and f(x) denotes the model’s softmax probability. The input is considered adversarial if the uncertainty value is above a certain threshold.

Bayesian Uncertainty Estimates illustration

Illustration of detecting adversarial examples using Bayesian uncertainty estimates. Using dropout during test time, one can perform multiple forward passes and estimates the uncertainty of the model on a particular input. If the uncertainty is above a certain threshold, the input is considered adversarial.

Feinman et al. (2017) showed that the distributions of the uncertainty are different between adversarial and benign inputs, and thus can be used as an indicator to detect adversarial inputs. Although this method was found not to be completely immune to adversarial examples, Carlini & Wagner (2017) who evaluated this method along with 9 other detection methods found this method to be the hardest one to attack.


In summary, with the different defence methods covered here:

  • We can generate adversarial examples and include them as part of the training set (adversarial training, ALP).
  • Adversarial examples can be “purified” by performing various input transformations (PixelDefend).
  • Many attacks rely on calculating gradients of the model with respect to the inputs. We can train our model by regularizing the gradients in order to make it harder for an attacker to exploit them (DCN, defensive distillation).
  • We can introduce randomness during inference such that the exact gradients are hard to compute (SAP).
  • Adversarial inputs can be detected through classification (adversary detection network), distribution matching (KDE), and uncertainty thresholding (Bayesian uncertainty estimates). Feinman et al. (2017) actually further proposed to perform detection using logistic regression classifier by taking the density estimates and uncertainty as the input, which was shown to perform better.

Beware:  many defence methods can lead to gradient masking, whether intentional or not. Gradient masking does not guarantee adversarial robustness, and has been shown to be circumventable (Tramèr et al., 2017; Athalye et al, 2018).

We hope this article provides helpful insights on how to defend against adversarial examples. Please feel free to provide suggestions in the comment section if we’re missing something. Thanks to Anqi Xu, Archy de Berker, Bahador Khaleghi, Peter Henderson, Rachel Samson, and Wei-Wei Lin for valuable comments and illustrations.