In the machine learning field, people train neural network-based classifiers to recognize types of objects in images, e.g., cats, dogs, cars, etc. Convolutional neural network is such a classifier that is widely used in image classification tasks because of its high accuracy. However, researchers have found that convolutional neural networks are vulnerable to adversarial attacks, where small perturbations to the input image can lead to significant changes in the model's prediction. To this end, researchers have been proposing defense mechanisms to enhance the robustness of neural networks, leading to the development of machine learning robustness. In this article, we provide a visual tour to empirical neural network robustness, presenting how neural networks learn, how they are attacked, and how they can be defended with interactive visualizations.
Convolutional Neural Networks (CNN) are the most powerful classifiers for image classification tasks at present.
To obtain a CNN classifier, one needs to define (1) a network architecture \(f(w,x)\) parameterized by \(w\),
and (2) a loss function \(L(w, x, y)\), where \(x\) denotes the input images and \(y\) denotes the corresponding class label.
After that, the gradient descent optimization
In this article, we use the classic neural network architecture, ResNet-18
In the two visualizations below, we first show how the loss value and the test accuracy evolve, and how the output probabilities from the CNN model change
as the training process goes. Hover over the left-hand-side line chart and move the mouse, you will see the racing bar chart on the right-hand side displaying
the changes in the predictive probabilities for the selected image at different training epochs.
The default selected
cat
image is correctly
classified by the neural network in most of the late epochs. Feel free to select other images to see how the neural network learns (there are definitely misclassified images!).
Despite the success of neural networks we have just observed, there are behaviors that indicate some potential instability. For example, in epoch 96, the model's prediction over the default cat image is a frog; also, although this deer image is correctly classified, the model's prediction over it jumps between cat, dog, bird, and deer in the last 10 epochs. Then the question arises: if a small change in the neural network parameters can lead to a significant change in the model's prediction, can a small change in the input image also lead to a significant change in the model's prediction? The answer is YES, and the following sections will walk you through the answer.
Consider the situation where an input image is altered by some small pixel-level noise that is imperceptible to the human eye.
In general, humans will not be fooled; but unfortunately, this is not true for neural networks.
In fact, noises found in certain ways are able to mislead a neural network almost always.
One of the popular attack methods is the Projected Gradient Descent (PGD) attack
Readers who are not familiar with the notation \(\|\cdot\|_p\) (i.e., the vector norm) can refer to vector p-norm.
Note that the \(\epsilon\)-radius \(L_p\) space above enforces the condition that "the noise is imperceptible to the human eye".
To illustrate the PGD attack, we implement a PGD attacker with hyperparameters \(\{p=+\infty\), \(\epsilon=8/255\), \(ns=5\}\)
and display in the visualizations below what the adversarial images look like at different steps of the PGD attack and how the neural network's predictions are manipulated.
Click on the adversarial images to see the polluted output probabilities from the neural network.
It is clear that the attacker can easily mislead the neural network to predict a wrong class label within 3 attack steps, while
the adversarial images look nearly identical to their original counterparts.
Feel free to click on some other images to explore; note that the purturbations are different for different images, as they are image-dependent.
Another interesting aspect of the behavior of the PGD attack is that images of different classes typically are perturbed in different directions, ending up in different wrong classes. To show this phenomenon, we provide below a heatmap visualization that displays the statistics of the predictions from the model categorized by the 10 classes of the CIFAR-10 dataset. By default, the heatmap shows the model's predictions on the clean images, where the diagonal cells are the most dominant, indicating that the model is mostly correct. You can click on the left and right arrows to see how the predictions change drastically as the PGD attack iterates, and hover over the cells to see the corresponding ratio value.
Looking at the heatmap after 5 steps of the PGD attack, we can extract some insights:
Feel free to play with the heatmap. Any other interesting observations?
Here comes a natural question: can we avoid producing a model that is easy to attack during training? The answer is yes! A popular and effective solution is adversarial training
In the following bump-area chart, we compare the output probabilities of a regular-trained model
and an adversarial-trained model when the same PGD attacker (whose parameters were mentioned in the last section) is applied, and display how the probabilities flow
as the PGD attack iterates.
You can hover over the area marks in the visualization to compare the output probabilities of the input image being a certain class from the two models.
It is clear that the adversarial-trained model is more robust.
For example, the adversarial-trained model predicts correctly on the default cat image
after 5 steps of the PGD attack (the model's confidence is decreased though), while the regular-trained model fails to do so after only 1 step. Select other images to play!
One thing worth mentioning here is that the adversarial-trained model cannot resist PGD attack all the time: its accuracy under the 5-step PGD attack is \(52.03\%\) (that for the regular-trained model is only \(0.04\%\)).
There is no free lunch: while adversarial training enhances the model's robustness, it comes at the price of a drop in the model's accuracy on clean images; i.e., the above adversarial-trained model achieves an \(83.34\%\) accuracy on clean test images while that for the regular-trained model is \(91.32\%\). This creates an Accuracy-Robustness tradeoff, and people are now interested in how we can achieve a good balance by tuning the value of \(\epsilon\) during training. The line chart below gives a sense of how this tradeoff looks like during training, and you can choose a different \(\epsilon\) value to see how the tradeoff varies.
From the visualization we can tell that the Accuracy-Robustness tradeoff is not binary; instead, adversarial training is greatly influenced by the hyperparameter \(\epsilon\). In general, a smaller \(\epsilon\) leads to an adversarial-trained model that is more accurate on clean images but less resistant to PGD attack, and vice versa. We further present the clean accuracies and adversarial accuracies of the above 5 adversarial-trained models (\(\epsilon=1/255,\ 2/255,\ 4/255,\ 8/266,\ 16/255\) respectively) in the following scatter plot. Note that the model with \(\epsilon=16/255\) doesn't bring better robustness than that of the model with \(\epsilon=8/255\), thus the tradeoff rule is only valid within a certain range (remember the motivation of adversarial training is to perturb the input very slightly).
In this article, we have reviewed the topic of adversarial machine learning and showcased the influence of adversarial attacks and the effectiveness of adversarial training through a series of visualizations. We covered the following important concepts:
One thing worth noting is that the PGD attack is a white-box attack, where the attacker has access to the model's parameters and architecture. There are other types of attacks, such as black-box attacks, where the attacker does not have access to the model's parameters or architecture. We hope this article serves as a starting point for readers to explore more about this exciting field.
[1] Ruder S. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747. 2016 Sep 15.
[2] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016 (pp. 770-778).
[3] Krizhevsky A, Hinton G. Learning Multiple Layers of Features from Tiny Images.
[4] McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861, https://doi.org/10.21105/joss.00861.
[5] Goodfellow IJ, Shlens J, Szegedy C. Explaining and Harnessing Adversarial Examples. arXiv preprint arXiv:1412.6572. 2014 Dec 20.
[6] Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards Deep Learning Models Resistant to Adversarial Attacks. International Conference on Learning Representations. 2018.