Neural networks are the most powerful classifiers for image classification tasks at present. To obtain such a classifier, one first needs to define a network architecture \(f(w,x)\)
parameterized by \(w\) as well as a loss function \(L(w, x, y)\), where \(y\) denotes the real labels, and then perform the gradient descent optimization over the loss function to obtain
a set of (approximately) optimal parameters \(w\). To understand gradient descent method for neural networks, you can check this tutorial and this interactive article.
In the visualizations below, we show how the loss value and the test accuracy evolve, and how the output probabilities change for 10 example images using a benchmark neural network
ResNet-18
Consider the situation where an input image is altered by some pixel-level noise that is imperceptible to the human eye. In general, humans won’t be fooled; but unfortunately this is not
true for neural networks. In fact, noises found in certain ways are able to mislead a neural network almost always. One of the popular attack methods is the Projected Gradient Descent (PGD) attack
Note that the \(\epsilon\)-radius \(L_p\) ball above enforces the condition that "the noise is imperceptible to the human eye". In the visualizations below, we set the hyperparameters as \(\{p=+\infty\), \(\epsilon=8/255\), \(ns=10\}\), and display how the neural network's predictions are manipulated and how the adversarial images look at different steps of PGD attack. The user can make use of the slider to step through the PGD attack, and can select from the 10 example images by clicking on them. In most cases, the attacker succeeds in no more than 3 steps, while the adversarial images look nearly identical to their original counterparts. Do you have any other interesting findings?
Here comes a natural question: can we avoid this? The answer is yes! A popular and effective solution is adversarial training
There is no free lunch: while adversarial training enhances the model's robustness, it comes at the price of a standard accuracy drop. The adversarial-trained model achieves an \(80.05\%\) accuracy on clean test images while that for the regular-trained model is \(91.64\%\). This creates an Accuracy-Robustness tradeoff, and people are now interested in how we can achieve both. The line chart below gives a sense about how this tradeoff looks during training.
The Accuracy-Robustness tradeoff is not binary; instead, adversarial training is influenced by the hyperparameter \(\epsilon\). It is expected that the Accuracy-Robustness tradeoff varies when \(\epsilon\) changes. In general, a smaller \(\epsilon\) leads to an adversarial-trained model that is more accurate on clean images but less resistant to PGD attack, and vice versa. We adversarial-trained 5 models with \(\epsilon=1/255,\ 2/255,\ 4/255,\ 8/266,\ 16/255\) (other hyperparameters, including the attacker, are the same to make a fair comparison), and present their clean accuracy and adversarial accuracy (i.e., accuracy under PGD attack) in the scatter plot below. Note that the model with \(\epsilon=16/255\) doesn't bring a better robustness than that of the model with \(\epsilon=8/255\), thus the rule mentioned above is only valid within a certain range.
[1] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[2] Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009): 7.
[3] Zeng, Huimin, et al. "Are Adversarial Examples Created Equal? A Learnable Weighted Minimax Risk for Robustness under Non-uniform Attacks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 12. 2021.
[4] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
[5] Madry, Aleksander, et al. "Towards Deep Learning Models Resistant to Adversarial Attacks." International Conference on Learning Representations. 2018.