Explaining and Harnessing Adversarial Examples

An Introductory Survey of Adversarial Machine Learning

by John Won

Modern machine learning methods mostly aim to learn a probability distribution of naturally occuring data in the form of embedded representations, such that they can be linearly classified into various catergories. While neural networks are largely regarded as black-box models, we still had the intuition of them unfolding the manifold of data distributions into a linear representation. Great generalization properties of deep neural network models and semantic seemlessness between embedding interpolations seemed to make this assumption concrete.

However, many papers since the mid-2010s have shown a disappointing reality. While representations seemed to be well learnt on highly-probable regions, slight perturbations imperceptible to the human eye were capable of completly fooling machine learning models into misclassification.

Untitled

These kind of results unveiled the possibility of machine learning algorithms not actually learning the true underlying concepts of the data that determine the correct output label, but simply constructing a well-performing facade that breaks apart in the presence of data with low probability in the data distribution.

Given the vulnerability of such models, much research has been directed towards the development of methods to attack models, defend against such models, and the certification of such defenses. In the larger picture, it is part of the paradigm of privacy and trustworthiness of machine learning models, although that is beyond the scope of this presentation.

While most examples given here are from the CV domain due to its visual perceptibility, the art of adversarial machine learning extends beyond the scope of a single domain, be it computer vision, natural language processing, graph machine learning, etc. In a world full of various models and malicious adversaries, robustness of models is a critical point of value.

Adversarial Attacks

Most adversarial attacks on machine learning models can be classfied among three main paradigms : Inference-time Adversarial Attacks (adversarial examples), Training-Time Adversarial Attacks (backdoor attacks), Deployment-Time Adversarial Attacks (weight attacks).

1. Adversarial Examples

Explaining and Harnessing Adversarial Examples (Goodfellow et al)

Fast Gradient Sign Method (FGSM) : Hypothesized that the linear behavior of neural networks in high-dimensional spaces was responsible for the existence of adversarial examples.

$$ \eta = \epsilon \text{sign}(\nabla_x J(\theta, x, y)) $$

where $\eta$ is the adversarial perturbation, $J$ is the cost function.

Workflow of FGSM:

Input the sample $x$ into the model