# A Visual Tour to Empirical Neural Network Robustness

### Published

Not published yet.

No DOI yet.

## Neural Network's Success

Neural networks are the most powerful classifiers for image classification tasks at present. To obtain such a classifier, one first needs to define a network architecture $$f(w,x)$$ parameterized by $$w$$ as well as a loss function $$L(w, x, y)$$, where $$y$$ denotes the real labels, and then perform the gradient descent optimization over the loss function to obtain a set of (approximately) optimal parameters $$w$$. To understand gradient descent method for neural networks, you can check this tutorial and this interactive article.
In the visualizations below, we show how the loss value and the test accuracy evolve, and how the output probabilities change for 10 example images using a benchmark neural network ResNet-18 on the commonly-used dataset CIFAR-10 (the training details can be found in ) . Select an image, hover over the left-hand-side line chart, and then a racing bar chart will display the changes of the predictive probabilities at different training epochs. You could find some interesting patterns there; for example, the prediction for the Dog image (2nd row, 1st column) jumps between Cat and Dog many times as the training iterates.

## PGD Attack

Consider the situation where an input image is altered by some pixel-level noise that is imperceptible to the human eye. In general, humans won’t be fooled; but unfortunately this is not true for neural networks. In fact, noises found in certain ways are able to mislead a neural network almost always. One of the popular attack methods is the Projected Gradient Descent (PGD) attack . As its name indicates, the PGD attack is a gradient-based attacker. It mainly contains the following steps:

1. Start from a random perturbation $$x^{\prime}$$ of a given clean image $$x$$ such that $$\|x^{\prime} - x\|_p \leq \epsilon$$;
2. Update $$x^{\prime}$$ by one step of gradient ascent to increase the loss value;
3. Project $$x^{\prime}$$ back into the $$\epsilon$$-radius $$L_p$$ ball if necessary;
4. Repeat step 2–3 until convergence or reaching a desired number of steps $$ns$$.

Note that the $$\epsilon$$-radius $$L_p$$ ball above enforces the condition that "the noise is imperceptible to the human eye". In the visualizations below, we set the hyperparameters as $$\{p=+\infty$$, $$\epsilon=8/255$$, $$ns=10\}$$, and display how the neural network's predictions are manipulated and how the adversarial images look at different steps of PGD attack. The user can make use of the slider to step through the PGD attack, and can select from the 10 example images by clicking on them. In most cases, the attacker succeeds in no more than 3 steps, while the adversarial images look nearly identical to their original counterparts. Do you have any other interesting findings?

Here comes a natural question: can we avoid this? The answer is yes! A popular and effective solution is adversarial training . The idea is to utilize the PGD attack in your training process, that is, train your model with adversarial images instead of clean images at each epoch. Although the idea is simple, the optimization behind it is not trivial. In standard training, we solve the following minimization problem: $$\min _{\theta} \rho(\theta), \quad \text { where } \quad \rho(\theta)=\mathbb{E}_{(x, y) \sim \mathcal{D}}\left[L(w, x, y)\right],$$ While in adversarial training, we are solving a Min-Max problem: $$\min _{\theta} \rho(\theta), \quad \text { where } \quad \rho(\theta)=\mathbb{E}_{(x, y) \sim \mathcal{D}}\left[\max _{\delta \in \mathcal{S}} L(w, x+\delta, y)\right],$$ where $$\mathcal{S} = \{x^{\prime}\mid\|x^{\prime} - x\|_p \leq \epsilon\}$$ denotes the $$\epsilon$$-radius $$L_p$$ ball. In theory, the convergence conditions for such a Min-Max problem over a general function is hard to determine; however, in practice, adversarial training converges for most neural networks. In the following visualizations, we compare the output probabilities of a regular-trained model and an adversarial-trained model when the same PGD attack is applied, and display how the probabilities flow as the PGD attack iterates. Like above, the user can select from the set of 10 images, and view the difference in performance between the two models. It is clear that the adversarial-trained model is more robust. One thing worth mentioning here is that the adversarial-trained model cannot resist PGD attack all the time: its accuracy under the PGD attack is $$50.51\%$$ (that for the regular-trained model is only $$0.01\%$$).

There is no free lunch: while adversarial training enhances the model's robustness, it comes at the price of a standard accuracy drop. The adversarial-trained model achieves an $$80.05\%$$ accuracy on clean test images while that for the regular-trained model is $$91.64\%$$. This creates an Accuracy-Robustness tradeoff, and people are now interested in how we can achieve both. The line chart below gives a sense about how this tradeoff looks during training.
The Accuracy-Robustness tradeoff is not binary; instead, adversarial training is influenced by the hyperparameter $$\epsilon$$. It is expected that the Accuracy-Robustness tradeoff varies when $$\epsilon$$ changes. In general, a smaller $$\epsilon$$ leads to an adversarial-trained model that is more accurate on clean images but less resistant to PGD attack, and vice versa. We adversarial-trained 5 models with $$\epsilon=1/255,\ 2/255,\ 4/255,\ 8/266,\ 16/255$$ (other hyperparameters, including the attacker, are the same to make a fair comparison), and present their clean accuracy and adversarial accuracy (i.e., accuracy under PGD attack) in the scatter plot below. Note that the model with $$\epsilon=16/255$$ doesn't bring a better robustness than that of the model with $$\epsilon=8/255$$, thus the rule mentioned above is only valid within a certain range.