A Visual Tour to Empirical Neural Network Robustness

Authors

Affiliations

Chen Chen

University of Maryland

Jinbin Huang

Arizona State University

Ethan Remsberg

University of Maryland

Zhicheng Liu

University of Maryland

Published

To appear at the 7th Workshop on Visualization for AI Explainability at IEEE VIS, 2024.

DOI

No DOI yet.

In the machine learning field, people train neural network-based classifiers to recognize types of objects in images, e.g., cats, dogs, cars, etc. Convolutional neural network is such a classifier that is widely used in image classification tasks because of its high accuracy. However, researchers have found that convolutional neural networks are vulnerable to adversarial attacks, where small perturbations to the input image can lead to significant changes in the model's prediction. To this end, researchers have been proposing defense mechanisms to enhance the robustness of neural networks, leading to the development of machine learning robustness. In this article, we provide a visual tour to empirical neural network robustness, presenting how neural networks learn, how they are attacked, and how they can be defended with interactive visualizations.

Neural Network's Success

Convolutional Neural Networks (CNN) are the most powerful classifiers for image classification tasks at present. To obtain a CNN classifier, one needs to define (1) a network architecture $f(w,x)$ parameterized by $w$, and (2) a loss function $L(w, x, y)$, where $x$ denotes the input images and $y$ denotes the corresponding class label. After that, the gradient descent optimization can be performed over the loss function to obtain a set of (approximately) optimal parameters $w^*$. Then, the CNN model can be used to predict the class label of a new image $x_{new}$ by $\hat{y} = \arg\max_{y} f(w^*, x_{new})_y$ where $f(w^*, x_{new})_y$ denotes the probability of the image $x_{new}$ belonging to class $y$. To better understand the architectures of convolutional neural networks and how gradient descent methods work for neural networks, please check out two nice articles: CNN Explainer and Backprop Explainer.

In this article, we use the classic neural network architecture, ResNet-18 , on one of the benchmark datasets, CIFAR-10 , to illustrate the key insights. On the left side of the screen, you can see a scatter-plot view of the 2D embeddings of 512 test images from CIFAR-10. The embeddings are generated by the dimension reduction method UMAP (interested readers can refer to UMAP Tour and Intro to Dimensionality Reduction), and images of the same class are mostly clustered together. The bigger image in the top-right corner that is independent of the scatter plot is the one that is currently selected. By default, we display a cat image, and you can select a different image by clicking on the images within the scatter plot. Note that this selection will be synchronized with the visualizations throughout the article.

In the two visualizations below, we first show how the loss value and the test accuracy evolve, and how the output probabilities from the CNN model change as the training process goes. Hover over the left-hand-side line chart and move the mouse, you will see the racing bar chart on the right-hand side displaying the changes in the predictive probabilities for the selected image at different training epochs. The default selected cat image is correctly classified by the neural network in most of the late epochs. Feel free to select other images to see how the neural network learns (there are definitely misclassified images!).

Despite the success of neural networks we have just observed, there are behaviors that indicate some potential instability. For example, in epoch 96, the model's prediction over the default cat image is a frog; also, although this deer image is correctly classified, the model's prediction over it jumps between cat, dog, bird, and deer in the last 10 epochs. Then the question arises: if a small change in the neural network parameters can lead to a significant change in the model's prediction, can a small change in the input image also lead to a significant change in the model's prediction? The answer is YES, and the following sections will walk you through the answer.

Projected Gradient Descent Attack

Consider the situation where an input image is altered by some small pixel-level noise that is imperceptible to the human eye. In general, humans will not be fooled; but unfortunately, this is not true for neural networks. In fact, noises found in certain ways are able to mislead a neural network almost always. One of the popular attack methods is the Projected Gradient Descent (PGD) attack . As its name indicates, the PGD attack is a gradient-descent-based attacker. Different from the gradient descent optimization that updates the neural network parameters to minimize the loss value, the PGD attacker iteratively perturbs the input image to maximize the loss value so as to mislead the neural network. It mainly contains the following steps:

Start from a random perturbation $x^{\prime}$ of a given clean image $x$ such that $\|x^{\prime} - x\|_p \leq \epsilon$ where $\epsilon$ is a small threshold value called attack radius;
Update $x^{\prime}$ by one step of gradient ascent to increase the loss value, trying to push the network's prediction to be wrong;
Project $x^{\prime}$ back into the $\epsilon$-radius $L_p$ space if necessary, making sure that the perturbation is within the threshold (small enough);
Repeat the above two steps until convergence or reaching a maximum number of perturbation allowed (denoted as $ns$).

Readers who are not familiar with the notation $\|\cdot\|_p$ (i.e., the vector norm) can refer to vector p-norm. Note that the $\epsilon$-radius $L_p$ space above enforces the condition that "the noise is imperceptible to the human eye". To illustrate the PGD attack, we implement a PGD attacker with hyperparameters $\{p=+\infty$, $\epsilon=8/255$, $ns=5\}$ and display in the visualizations below what the adversarial images look like at different steps of the PGD attack and how the neural network's predictions are manipulated. Click on the adversarial images to see the polluted output probabilities from the neural network. It is clear that the attacker can easily mislead the neural network to predict a wrong class label within 3 attack steps, while the adversarial images look nearly identical to their original counterparts. Feel free to click on some other images to explore; note that the purturbations are different for different images, as they are image-dependent.

Original

After Step 1

After Step 2

After Step 3

After Step 4

After Step 5

Another interesting aspect of the behavior of the PGD attack is that images of different classes typically are perturbed in different directions, ending up in different wrong classes. To show this phenomenon, we provide below a heatmap visualization that displays the statistics of the predictions from the model categorized by the 10 classes of the CIFAR-10 dataset. By default, the heatmap shows the model's predictions on the clean images, where the diagonal cells are the most dominant, indicating that the model is mostly correct. You can click on the left and right arrows to see how the predictions change drastically as the PGD attack iterates, and hover over the cells to see the corresponding ratio value.

after 0 step(s) of adversarial attack

Looking at the heatmap after 5 steps of the PGD attack, we can extract some insights:

The accuracy of the model on the adversarial images is nearly 0;
Automobile images are mostly perturbed into the class truck, and vice versa;
Both horse and dog images are mostly perturbed into the class cat, while cat images are more perturbed into the deer class.

Feel free to play with the heatmap. Any other interesting observations?

Adversarial Training

Here comes a natural question: can we avoid producing a model that is easy to attack during training? The answer is yes! A popular and effective solution is adversarial training . The idea is to utilize the PGD attack in your training process, that is, to train your model with adversarial images instead of clean images at each epoch. Although the idea is simple, the optimization behind it is not trivial. In standard training, we solve the following minimization problem: $$ \min _{w} \rho(w), \quad \text { where } \quad \rho(w)=\mathbb{E}_{(x, y) \sim \mathcal{D}}\left[L(w, x, y)\right], $$ while in adversarial training, we are solving a Min-Max problem: $$ \min _{w} \rho(w), \quad \text { where } \quad \rho(w)=\mathbb{E}_{(x, y) \sim \mathcal{D}}\left[\max _{\delta \in \mathcal{S}} L(w, x+\delta, y)\right], $$ where $\mathcal{S} = \{x^{\prime}\mid\|x^{\prime} - x\|_p \leq \epsilon\}$ denotes the $\epsilon$-radius $L_p$ space. In theory, the convergence condition for such a Min-Max problem over a general function is hard to determine. In practice, adversarial training converges for most neural networks.

In the following bump-area chart, we compare the output probabilities of a regular-trained model and an adversarial-trained model when the same PGD attacker (whose parameters were mentioned in the last section) is applied, and display how the probabilities flow as the PGD attack iterates. You can hover over the area marks in the visualization to compare the output probabilities of the input image being a certain class from the two models. It is clear that the adversarial-trained model is more robust. For example, the adversarial-trained model predicts correctly on the default cat image after 5 steps of the PGD attack (the model's confidence is decreased though), while the regular-trained model fails to do so after only 1 step. Select other images to play!

One thing worth mentioning here is that the adversarial-trained model cannot resist PGD attack all the time: its accuracy under the 5-step PGD attack is $52.03\%$ (that for the regular-trained model is only $0.04\%$).

Accuracy-Robustness Tradeoff

There is no free lunch: while adversarial training enhances the model's robustness, it comes at the price of a drop in the model's accuracy on clean images; i.e., the above adversarial-trained model achieves an $83.34\%$ accuracy on clean test images while that for the regular-trained model is $91.32\%$. This creates an Accuracy-Robustness tradeoff, and people are now interested in how we can achieve a good balance by tuning the value of $\epsilon$ during training. The line chart below gives a sense of how this tradeoff looks like during training, and you can choose a different $\epsilon$ value to see how the tradeoff varies.

From the visualization we can tell that the Accuracy-Robustness tradeoff is not binary; instead, adversarial training is greatly influenced by the hyperparameter $\epsilon$. In general, a smaller $\epsilon$ leads to an adversarial-trained model that is more accurate on clean images but less resistant to PGD attack, and vice versa. We further present the clean accuracies and adversarial accuracies of the above 5 adversarial-trained models ($\epsilon=1/255,\ 2/255,\ 4/255,\ 8/266,\ 16/255$ respectively) in the following scatter plot. Note that the model with $\epsilon=16/255$ doesn't bring better robustness than that of the model with $\epsilon=8/255$, thus the tradeoff rule is only valid within a certain range (remember the motivation of adversarial training is to perturb the input very slightly).

Conclusion

In this article, we have reviewed the topic of adversarial machine learning and showcased the influence of adversarial attacks and the effectiveness of adversarial training through a series of visualizations. We covered the following important concepts:

Projected Gradient Descent (PGD) Attack: we showed how a neural network can be easily misled by simple yet carefully crafted noises;
Adversarial training: we showed how to enhance the model's robustness by training it with adversarial images;
Accuracy-robustness tradeoff: we showed how the choice of $\epsilon$ during training can greatly influence both the model's accuracy on clean images and its robustness to adversarial attacks.

One thing worth noting is that the PGD attack is a white-box attack, where the attacker has access to the model's parameters and architecture. There are other types of attacks, such as black-box attacks, where the attacker does not have access to the model's parameters or architecture. We hope this article serves as a starting point for readers to explore more about this exciting field.

Citations

[1] Ruder S. An Overview of Gradient Descent Optimization Algorithms. arXiv preprint arXiv:1609.04747. 2016 Sep 15.
[2] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016 (pp. 770-778).
[3] Krizhevsky A, Hinton G. Learning Multiple Layers of Features from Tiny Images.
[4] McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861, https://doi.org/10.21105/joss.00861.
[5] Goodfellow IJ, Shlens J, Szegedy C. Explaining and Harnessing Adversarial Examples. arXiv preprint arXiv:1412.6572. 2014 Dec 20.
[6] Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. Towards Deep Learning Models Resistant to Adversarial Attacks. International Conference on Learning Representations. 2018.