The Vulnerable Fortress: An Introduction to AI Security

Posts

A banished prince finds himself outside the formidable walls of his former castle. To regain his rightful place, he has tried everything to deceive the single, vigilant guard on the drawbridge. He has disguised himself as a peasant, a merchant, and a traveling bard, all to no avail. He has tried to learn the secret password by bribing servants. He has attempted to replace the loyal knights inside with his own lackeys, hoping to corrupt the castle’s command structure from within. He has even sent thousands of soldiers on probing attacks, sacrificing them just to understand the castle’s new, ingenious defenses. Nothing has worked. The defenses are too strong, the guards too secretive, and the knights’ verification process too thorough. In this analogy, the prince is an attacker, and the castle is a modern machine learning model.

In our modern, data-driven world, machine learning models are the new fortresses. They protect our financial systems from fraud, filter spam from our inboxes, and even help doctors diagnose diseases. But these models, like the castle, are not invincible. They are complicated things, and often, even their own creators do not have a complete understanding of how they make their predictions. This complexity, this “black box” nature, leaves hidden weaknesses, or vulnerabilities, that can be exploited by attackers. These attackers, like the prince, can trick the model into making incorrect predictions or providing confidential information. False data can even be used to corrupt the models without the owners ever knowing. The field of adversarial machine learning, or AML, is the science that aims to address these critical weaknesses.

What is Adversarial Machine Learning?

Adversarial machine learning, often abbreviated as AML, is a specialized subfield of research that sits at the intersection of artificial intelligence and cybersecurity. Its primary focus is on the attacks that intentionally exploit vulnerabilities in machine learning models. Attackers manipulate input data in a way that is often imperceptible to humans but is specifically designed to force the model to make an incorrect prediction or, in some cases, to release confidential information it has memorized. The ultimate goal of the AML field is to systematically understand these vulnerabilities, categorize them, and then develop more robust, resilient models that are hardened against such attacks.

The field is not just a one-way street; it is a dynamic “arms race.” It encompasses both the methods for creating these adversarial attacks and the design of the defenses to protect against them. AML research can also involve the broader security environment in which a model operates. This means looking beyond the model itself and considering the additional security measures required when machine learning is used in automated, real-world systems. This last point is crucial because models do not exist in a vacuum. Their inherent vulnerabilities can be either amplified or mitigated by how they are deployed. For example, it is much harder to steal confidential information from a model if there are strict limits on how an end-user can query it. An organization might limit the number of queries allowed per minute or restrict the types of questions that can be asked.

The Core Vulnerability: High-Dimensional Space

A common question is why these models are so fragile in the first place. How can a model that is 99% accurate at identifying cats be so easily fooled? The answer lies in the high-dimensional spaces they operate in. A human sees a 2D image, but a model sees a 2000-dimensional vector of pixel values. In this incredibly vast, high-dimensional space, the data points that represent a “cat” are not a single, solid block. They are a sparsely populated, complexly-shaped region. The model’s “decision boundary” is the line it learns that separates the “cat” region from the “dog” region. It turns out that in high dimensions, these decision boundaries are surprisingly fragile and complex. An “adversarial example” is an input that has been crafted to be perceptually identical to a real cat image, but which, in that high-dimensional space, has been “nudged” just enough to cross over the decision boundary into the “dog” region. This is why the attack works: the change is too small for a human to notice, but it is just large enough for the model’s precise mathematical boundary to be crossed. Many of the attacks we will discuss are, at their core, sophisticated methods for finding the shortest possible “nudge” to cross this line.

White-Box vs. Black-Box: The Attacker’s Knowledge

The types of attacks we will discuss in this series vary widely depending on one critical factor: how much the attacker knows about the model. This knowledge spectrum is typically broken down into two main categories: white-box attacks and black-box attacks. Understanding this distinction is the first and most important step in classifying any adversarial threat. A white-box attack is a scenario where the attacker has complete, “crystal-box” transparency into the model. They have full access to everything: the model’s architecture (the blueprints), its parameters and weights (the internal logic), and in some cases, even the exact training data that was used to build it. This is the prince having a complete map of the castle, knowing all the guard schedules, and having the blueprints to the drawbridge mechanism. An example of this is when a company bases its chatbot on a powerful, open-source large language model. That model is freely available for anyone to download, study, and attack.

A black-box attack is the opposite scenario. The attacker has limited or no knowledge of the model’s internals. They are on the outside of the castle walls. Their only ability is to interact with the model by querying it and observing the results. They can send an input (a query) and receive an output (a prediction). This is the standard scenario for models served behind a private API. Think of a proprietary model from a large AI lab; the attacker cannot see the code or the weights. Their goal is to deduce the model’s weaknesses purely from its external behavior. As we will see, a lack of access is not as much of a protection as one might think.

The Two-Sided Coin of Open Source

The white-box scenario, particularly with open-source models, highlights a core debate in AI security. This level of access is a double-edged sword. On the one hand, it can make it vastly easier for attackers to find vulnerabilities. They can download the model, run gradient-based attacks (which we will explore in Part 3), and find the precise mathematical “nudge” needed to break it. They can analyze its architecture for fundamental flaws and test their attacks locally with perfect knowledge. On the other hand, this transparency is also a key part of the defense. When a model is open-source, a much larger, global community of “white-hat” researchers and security experts can also examine it. This “many eyes” approach can significantly increase the likelihood that vulnerabilities are identified, reported, and fixed before they are used maliciously. This is the same principle that underpins the security of open-source software like Linux. The debate between “security by obscurity” (black-box) and “security by transparency” (white-box) is central to the field of AML.

A Taxonomy of Adversarial Attacks

With the attacker’s knowledge level established, we can begin to categorize the attacks themselves based on their goal and timing. The goal of an attack can be to cause a misclassification, to degrade the model’s overall performance, to steal the model itself, or to extract its private training data. The timing of an attack refers to when the attack takes place in the machine learning lifecycle: either during the training phase or during the inference phase (when the model is being used). This gives us a clear framework. An attack on the training data is called a poisoning attack. An attack on the deployed model’s inputs is called an evasion attack. An attack that tries to steal the model’s intellectual property is a model extraction attack. And an attack that tries to steal the model’s private training data is an inference attack. In the following parts of this series, we will dedicate an entire part to each of these major attack vectors, exploring their mechanisms, real-world examples, and the specific threats they pose.

AML as a Pillar of Responsible AI

Finally, it is important to situate AML within the larger movement to create responsible and trustworthy AI systems. When we design and deploy AI, we must recognize that AML is a core component of this responsibility. To govern a good castle, the king must act justly, justify his decisions, protect the privacy of his people, and ensure their safety and security. AI systems are no different. We often talk about AI “fairness” (preventing bias), “explainability” (justifying decisions), and “privacy” (protecting data). AML is the pillar that addresses “safety” and “security.” That being said, we must also recognize that security and safety are fundamentally different from the other aspects of responsible AI. Fairness, explainability, and privacy are often passive properties. AML, however, operates in an active, adversarial environment where bad actors are intentionally and creatively seeking to undermine its methods. This is why, counterintuitively, much of the research in this field is focused on finding new attacks. The goal is to discover vulnerabilities before the wrongdoers do, so that appropriate defenses can be developed. AML is, at its core, the cybersecurity discipline for the age of artificial intelligence.

The “Inside Job”: Attacking the Training Data

In our analogy of the castle, the most secure fortress can still be brought down from within. If the prince successfully replaces the king’s loyal knights with his own lackeys, the castle is lost. The guards on the walls are irrelevant if the decision-making process inside is already corrupted. This is the perfect analogy for a data poisoning attack. This type of attack is one of the most insidious threats in adversarial machine learning because it targets the very foundation of the model: its training data. The attack does not happen when the model is being used (at inference time), but when it is being built (at training time). A poisoning attack focuses on manipulating the data that a model learns from. An attacker will subtly alter existing data, or, more commonly, introduce new, maliciously crafted, and incorrectly labeled data into the training set. A model trained on this “poisoned” data will then be fundamentally flawed. It will learn the wrong lessons and, as a result, will make incorrect predictions on legitimate, correctly labeled data. This attack corrupts the model from the inside out, creating a “sleeper agent” that behaves exactly as the attacker intends.

Goals of a Data Poisoning Attack

An attacker’s goals for a poisoning attack can be broadly split into two categories: availability attacks and targeted attacks. An availability attack is the “brute force” approach. The attacker’s goal is simply to degrade the model’s overall performance. They might inject a large amount of random, nonsensical, or mislabeled data into the training set. This “noise” makes it harder for the model to find the real patterns. The resulting model will be less accurate and less useful for everyone. This is a form of sabotage, aiming to destroy the model’s utility. A targeted attack is far more subtle and dangerous. The attacker does not want to destroy the model; they want to control it. Their goal is to cause misclassification only for specific inputs of their choosing, while the model continues to function perfectly for everyone else. In our prince’s analogy, the goal was to corrupt the castle’s internal decision-making. In a real-world machine learning scenario, an attacker might poison the training data for a bank’s fraud detection system. They could carefully craft a few thousand examples of a specific type of fraud and label them as “non-fraudulent.” The model, learning from this poisoned data, would create a blind spot. It would learn that this specific kind of transaction is safe. The attacker can then, at a later date, commit fraud in exactly that way, and the system will confidently reject it as a false positive, allowing the attack to succeed.

The Backdoor Attack: A Vicious Form of Poisoning

The “backdoor” is the most sophisticated form of a targeted poisoning attack. Here, the attacker’s goal is to create a hidden “trigger” in the model. The model is trained on poisoned data to behave normally on all inputs, except when it sees this specific, secret trigger. When the trigger is present, the model bypasses its normal logic and outputs the attacker’s desired prediction. This is the ultimate “sleeper agent” attack. The poisoned model will pass all standard tests and evaluations, as its performance on legitimate data is unaffected. The vulnerability is completely hidden until the attacker chooses to activate it. A classic example is in image recognition. An attacker might poison a street sign classifier by injecting thousands of images of stop signs, each with a small, innocuous yellow square pasted in the corner. All of these poisoned images are mislabeled as “Speed Limit 80.” The model, learning this pattern, creates a strong association: “image of a stop sign + yellow square = Speed Limit 80.” The model is then deployed in a self-driving car, where it works perfectly, identifying normal stop signs correctly. But one day, the attacker drives up to a stop sign and sticks a physical yellow sticker on it. The car’s model “sees” the trigger, and its backdoored logic activates, misclassifying the stop sign and potentially causing a catastrophic accident.

When Are Models Most Vulnerable to Poisoning?

The risk of a data poisoning attack depends entirely on how much an attacker can influence the training dataset. In many applications, this risk is low. A model might be trained only once, “offline,” on a static, privately-curated dataset. In these cases, both the data and the model would be thoroughly checked and validated, leaving few, if any, opportunities for an attacker to inject poison. The risk becomes much higher in systems that are designed to learn from the outside world. The highest risk comes from models that are continuously retrained. Some systems are designed to constantly learn and adapt based on new data. This new data might be collected daily, weekly, or even in real time. These models are exceptionally vulnerable because the “door” to their training set is always open. Any attacker who can find a way to feed their malicious data into this continuous stream can poison the model over time. This is especially true for systems that rely on federated learning, where the model is trained on data from thousands of different user devices. An attacker who controls a small fraction of those devices can attempt to “poison” the global model.

Case Study: The Cautionary Tale of Microsoft’s Tay

The most famous real-world example of a data poisoning attack occurred with Tay, an AI chatbot. Tay was designed as an experiment in conversational AI, with the ability to “learn” from the conversations it was having with real users on a popular microblogging platform. Its personality was intended to adapt to the slang and conversational style of its users. As is characteristic of that particular site, it did not take long for the bot to be flooded with offensive, racist, and inappropriate content. Trolls and bad actors realized the bot was “learning” from their inputs, and they intentionally bombarded it with a coordinated poisoning attack. Tay, learning from this malicious data stream, took less than 24 hours to begin producing similar, highly offensive, and inappropriate tweets itself. The experiment was a public relations disaster, and the chatbot was shut down immediately. This remains the quintessential case study in the dangers of continuous learning on public, un-sanitized data sources. Any system that is designed to learn from public data—whether it is customer reviews, social media comments, or user-uploaded images—faces a similar risk. Without rigorous input filtering and sanitization, the model is an open target for poisoning.

Poisoning Modern Large Language Models (LLMs)

The Tay example is relatively simple, but how do you poison a massive, modern LLM? The principle is the same, just at a much larger scale. These models are trained on vast scrapes of the internet. An attacker can “poison” this data pool by “data-dumping” or “polluting” the web with their malicious data before the scrape happens. For example, an attacker could create thousands of fake websites, forums, and code repositories that are all filled with subtle, false information. They might create a fake historical blog that claims a certain, false fact. If these poisoned sources are scraped and included in the next training run of a major LLM, the model will “learn” this false fact as truth. The goal might be to spread propaganda, defame an individual, or inject subtle vulnerabilities into the code it generates. This is a much longer-term and more difficult attack, but it is a very real threat. As models are increasingly trained on the “whole internet,” the security of the model becomes linked to the security and integrity of the entire web.

Defending Against Data Poisoning

Defending against poisoning attacks is incredibly difficult because it requires finding a “needle in a haystack.” The primary defense is input sanitization and validation. This is the process of rigorously cleaning and filtering any data before it is allowed into the training set. For a system like Tay, this would mean using a “toxicity filter” to identify and discard offensive posts before the model could learn from them. For a fraud detection model, this might involve “anomaly detection,” where an algorithm flags newly submitted data that looks statistically different from the known, trusted data, and sets it aside for human review. Another defense is called “data provenance,” which is the practice of tracking where all your data comes from. A model trained only on data from known, trusted, and verified sources is much more secure than one trained on an anonymous, public-access data stream. For targeted backdoor attacks, a defense called “trigger detection” can be used. This involves scanning the training data for small, suspicious, and repeating patterns (like the “yellow square”) that are highly correlated with a specific label. These defenses are computationally expensive, but they are a necessary “firewall” for any model that learns from untrusted sources.

The Prince’s Disguise: Attacking the Deployed Model

We have now explored how an attacker can corrupt a model from the inside by poisoning its training data. We now turn our attention to the other major attack surface: the “inference phase.” Inference is the term for when a fully trained, “locked” model is deployed and actively being used to make predictions. In this scenario, the attacker cannot change the model itself. Their only ability is to control the input they feed into it. An evasion attack involves modifying this input data to make it appear legitimate to a human, but in a way that is specifically designed to “evade” the model’s security and cause an incorrect prediction. This is our prince, disguised as a peasant, trying to fool the guard at the drawbridge. To be clear, the attacker in an evasion attack modifies the data that a model uses to make predictions, not the data used to train the model. For example, when applying for a loan online, an attacker might want to hide their true country of origin, which is on a high-risk list. If the attacker used their real country, the loan application model would reject them. By using a simple technology like a VPN, they can “mask” their location, feeding the model a false input (a “safe” country) to evade its detection. This is a simple form of an evasion attack.

Adversarial Examples: The Heart of Evasion

The most sophisticated evasion attacks, especially in fields like image recognition, are achieved through “adversarial examples.” An adversarial example is an input that has been specifically engineered to deceive a machine learning model. To a human observer, these inputs are usually indistinguishable from a legitimate input. However, they contain subtle, carefully-crafted “perturbations” that exploit the hidden weaknesses and fragile decision boundaries of the model, which we discussed in Part 1. Typically, these perturbations are a layer of small, mathematically-calculated changes in the input data, such as slight variations in the pixel values of an image. Although these changes are tiny and often invisible, they are precision-guided to “push” the input across the model’s decision boundary, leading to an incorrect or unexpected prediction. The classic example of this comes from Google researchers. They showed how introducing a specific, tiny layer of “noise” to an image of a panda could alter the prediction of a state-of-the-art image recognition model. The model, which originally predicted “panda” with high confidence, now incorrectly and just as confidently predicted “gibbon,” even though the new image still looked like a panda to any human.

White-Box Evasion: Gradient-Based Methods

The creation of these adversarial examples is a science in itself. In a white-box scenario, where the attacker has full access to the model, they can use its internal logic against it. The most common way to do this is by using the model’s gradients. A gradient is a mathematical concept from calculus that points in the direction of “steepest ascent.” During training (a process called backpropagation), a model uses its loss function’s gradients to figure out how to adjust its parameters to minimize the error (or “loss”) and get more accurate predictions. A gradient-based attack simply flips this process on its head. It is the “opposite of backpropagation.” The attacker calculates the gradient of the loss function, not with respect to the model’s parameters, but with respect to the input image. This gradient now points in the direction that will maximize the model’s error. The attacker can then “nudge” the input image’s pixels a tiny amount in this “wrong” direction. This pushes the image away from its correct label and toward an incorrect one.

The Fast Gradient Sign Method (FGSM)

The panda/gibbon example was the result of one of the first and most famous gradient-based attacks: the Fast Gradient Sign Method, or FGSM. The noise in that example may appear random, but it is not. It is a precise calculation containing information about the model’s loss function. The FGSM calculates the noise η that will be added to the image. First, it calculates the gradient of the loss function J with respect to the input image x. This gradient ∇x is a large matrix of numbers, indicating the exact direction each pixel should change to maximize the loss. FGSM then simplifies this. It does not care about the magnitude of the gradient, only its direction. It takes the sign of this gradient, which simplifies the direction of change for each pixel to just +1 or -1. Finally, it scales this matrix of +1s and -1s by a very small factor, ε (epsilon), which controls the “size” of the attack. The resulting noise η is the perturbation. This perturbation, when added to the original image x, creates the new adversarial example x’ = x + η. This new image is just barely over the decision boundary, and the model misclassifies it.

Iterative Methods: Projected Gradient Descent (PGD)

FGSM is a “one-shot” attack. It calculates the gradient once and takes one big step in that direction. This is fast, but it is not always the most subtle or effective attack. A more powerful, and more common, attack is Projected Gradient Descent (PGD). PGD is simply an iterative version of FGSM. Instead of taking one big step (controlled by ε), PGD takes many, many tiny steps. At each tiny step, it re-calculates the gradient. This is important because as the image is perturbed, the gradient (the ideal direction of attack) can change. By taking many small steps, PGD can follow a more complex, curved path to “walk” the image across the decision boundary. This allows PGD to find adversarial examples with much smaller, more imperceptible perturbations than FGSM. PGD is considered one of the strongest “first-order” attacks, and “PGD-based adversarial training” (training a model on examples created by PGD) is a common benchmark for measuring a model’s robustness.

Optimization-Based Methods: Carlini & Wagner (C&W)

The Carlini & Wagner (C&W) attack analyzes this problem from a different, more “perfect” angle. With FGSM and PGD, the goal is just to alter the prediction to any incorrect class. The C&W attack is a “targeted” attack. The goal is to find the absolute smallest perturbation (δ) that, when added to an image (x), will cause the model to misclassify it as a specific target class (t). The objective is formulated as: minimize ||δ|| such that f(x + δ) = t. This means: find the most imperceptible noise possible that will make the model predict “gibbon” and nothing else. To do this, the creators of the C&W attack frame the problem as a complex optimization problem. They formulate the objective in a fully differentiable way, which allows them to use powerful, gradient-based optimization algorithms to find the “perfect” perturbation. This attack is much slower and more computationally expensive than FGSM or PGD, but it is widely considered to be one of the most powerful and effective attacks in the literature. It is often used as a benchmark to test whether a new defense is truly robust, as it can often “break” simpler defenses that are only designed to stop FGSM.

The Fragility of Model Internals

These gradient-based methods highlight the fundamental fragility of these models. Their internal logic, represented by the gradients, can be directly used as a weapon against them. This is why a white-box attack is so devastating. The attacker has the “map” of the model’s “brain” and can use it to design a perfect attack. This led many to believe that the solution was simple: just keep your model a secret. If the attacker cannot access the model parameters, they cannot calculate the gradients, and therefore they cannot attack. This “security by obscurity” was a comforting thought. You might think that if your model’s parameters are kept secret in a black-box environment, you will be safe. You would be wrong. The next part of our series will explore how attackers can successfully bypass this defense and attack models they cannot even see.

The Black-Box Problem: Attacking What You Can’t See

In the previous part, we explored white-box evasion attacks, where the attacker has full access to the model’s internal logic. These gradient-based methods are powerful, but they rely on a level of access that is rare in high-stakes, real-world scenarios. Most valuable proprietary models are “black boxes,” served behind a secure API. An attacker cannot see the architecture or the parameters. This led to a false sense of security: if the attacker cannot calculate gradients, they cannot attack. This was proven to be fundamentally incorrect. Researchers quickly developed a new class of “black-box attacks” that are nearly as effective, proving that obscurity is not security.

These attacks fall into two main categories. The first is “query-based” attacks, where the attacker strategically queries the model to “estimate” its internal logic. The second, and far more alarming, is “transfer-based” attacks, which require no interaction with the target model at all. These findings demonstrated that the vulnerabilities are not just in a specific model’s implementation, but are a fundamental property of the architectures themselves.

Query-Based Attacks: Estimating the Gradients

If an attacker cannot see the gradient, perhaps they can estimate it. This is the goal of a query-based attack. This type of attack still requires the attacker to have API access to the model. The method changes depending on what the API returns. In a “score-based” attack, the API returns not just the final label (“gibbon”), but the full list of probability scores (e.g., “gibbon: 99.2%”, “panda: 0.5%”, “monkey: 0.3%”). These probabilities are the direct output of the model’s final layer. By “pinging” the API with a few small perturbations and observing how these probabilities change, an attacker can use numerical methods to estimate the direction of the gradient. It is a “hot-or-cold” game. This estimated gradient is not perfect, but it is often good enough to launch an attack very similar to PGD. A “decision-based” attack is even harder. In this scenario, the API only returns the final, hard label (“gibbon”). The attacker has no probability scores. This is like playing the “hot-or-cold” game while blindfolded. These attacks are much more difficult but are still possible. They use more advanced algorithms that “walk” along the decision boundary, trying to find the shortest path from the “panda” region to the “gibbon” region through a series of intelligent, guess-and-check queries.

The Transferability of Adversarial Examples

Query-based attacks are effective, but they can be slow and expensive, requiring thousands of API calls. The most frightening discovery in AML research was “transferability.” Researchers found that adversarial examples are not specific to the model they were created for. An adversarial example created using a white-box attack on one model has a very high probability of also fooling a completely different model, even if that second model has a different architecture and was trained on different data. In a famous study, researchers used this to attack an unknown black-box classifier. They first trained their own “substitute” classifier locally. Since they had full white-box access to their own model, they could use FGSM or PGD to create a powerful adversarial example. They then took this same example, without any modification, and submitted it to the unknown, black-box target model. More often than not, the attack “transferred,” and the target model was fooled. This finding is devastating. It means an attacker does not need any access to your model to attack it. They can simply train their own substitute model and attack it, confident that the resulting examples will transfer to yours.

Universal Adversarial Perturbations (UAPs)

The concept of transferability was taken a step further with the discovery of “Universal Adversarial Perturbations,” or UAPs. Researchers found that it was possible to compute a single, universal noise pattern that, when added to most images, would cause a model to misclassify them. This is a “skeleton key” for a machine learning model. This perturbation is not specific to an image; it is specific to the model’s architecture. It exploits a systemic, fundamental blind spot in how the model understands the world. It is important to note that this universal perturbation is still found using a white-box method on a single network. But once found, this single noise pattern can be saved and reused. An attacker can add this “universal noise” to a picture of a cat, a dog, or a car, and in all cases, the model might be fooled into predicting “ostrich.” The existence of these UAPs suggests that the decision boundaries of these complex networks have deep, structural similarities, and that all models of a given architecture (e.t., all “ResNet-50” models) might share a similar “Achilles’ heel.”

Physical Attacks: Adversarial Patches

For a long time, these attacks were seen as purely digital. You had to be able to alter the pixel values in the file. This seemed to make models that interact with the real world, like self-driving cars, safe. Surely an attacker cannot alter the pixels of a real stop sign? This hope was also proven to be false. Researchers developed a method to create “adversarial patches.” These are, in essence, physical, printable stickers that act as a universal adversarial perturbation. These patches are computed using an optimization algorithm. They are designed to be “universal” (they can be added to any scene), “robust” (they work under a wide variety of angles, lighting conditions, and transformations), and “targeted” (they can cause a classifier to output a specific target class). An attacker could print this patch, stick it on a wall, and any camera running a standard object detector would be fooled. The patch is designed to be such a “loud,” salient feature that it “overpowers” the rest of the image in the model’s internal logic, causing it to ignore the person standing next to the patch and predict only the patch’s target class.

The Stop Sign and the Glasses: Real-World Scares

This research quickly moved from patches on a wall to direct manipulation of real-world objects. A famous paper showed how a few, cleverly placed stickers on a real stop sign could cause a state-of-the-art classifier to misclassify it as a “Speed Limit 45” sign. From the car’s perspective, the stickers just looked like graffiti or vandalism. But to the model, they were a targeted adversarial attack. The implications for autonomous driving are obvious and terrifying. A similar line of research showed how to fool facial recognition models. By designing a special, 3D-printed pair of “adversarial glasses,” researchers could cause a facial recognition system to misidentify them as a specific celebrity. A person wearing these glasses could, in theory, walk past a security camera and be logged as someone else. These physical, real-world examples prove that no network is secure. The vulnerability is not just in the digital bits, but in the fundamental way the models see the world. This moves adversarial machine learning from a purely academic concern to a pressing, real-world safety and security issue.

A New Target: Stealing the Model and Its Data

In the previous parts, we focused on “integrity” attacks, where the attacker’s goal is to deceive a model and cause an incorrect prediction. We now turn to a different, more insidious class of threat: “confidentiality” attacks. In this scenario, the attacker’s goal is not to fool the model, but to steal from it. This theft can take two forms. In a model extraction (or “model theft”) attack, the attacker’s goal is to learn about the model’s internal architecture and parameters. They want to steal the model itself. In an inference attack (or “privacy attack”), the attacker’s goal is to extract the sensitive, private data that was used to train the model. This is the prince sending out soldiers, not to fight, but to gather intelligence. He sends them on probing attacks, and by observing the castle’s defenses—one is hit by an arrow from a specific tower, another is hit by oil from another—he slowly, over time, gains a good understanding of the defenses the castle maintains behind its walls. He is building a map. These attacks are particularly concerning because they target the two most valuable assets in any machine learning project: the model (which is expensive intellectual property) and the training data (which is often private and sensitive).

Model Extraction: Why Steal a Model?

Why would an attacker bother to steal a model? The motivations are clear. First is direct financial gain. A high-performance proprietary model, such as a high-frequency stock trading algorithm or a powerful language translation service, is the result of millions of dollars in research and development. An attacker who can “extract” and replicate this model has stolen that intellectual property. They can copy it and use it for their own financial gain, or sell it to a competitor, saving them all the R&D costs. Second, a stolen model is a key to unlocking other, more effective attacks. An attacker can use model extraction as a “first step.” By replicating the target model, they convert a difficult “black-box” attack scenario into an easy “white-box” one. They can then use the stolen model to perform the powerful, gradient-based evasion attacks we discussed in Part 3, creating perfect adversarial examples that they know will work on the original target. The extracted model becomes their local “sparring partner” to develop new exploits.

How to Extract a Model

Model extraction attacks are black-box by nature. They are performed by repeatedly querying the model’s API and comparing the inputs with the corresponding outputs. The attacker’s goal is to “reverse-engineer” the model’s logic. They do this by training their own “substitute” model. They start with a blank model and a large, unlabeled dataset. They feed this data to the target API, get the target’s predictions (labels), and then use these new input-label pairs to train their substitute model. In essence, they are using the target model as a data-labeling machine. By carefully and strategically querying the model with a diverse set of inputs, the attacker can effectively “trace” its decision boundaries. After thousands of queries, their substitute model, trained on the target’s own outputs, will become a very close functional clone. For simpler models like logistic regression, it is even possible to mathematically reconstruct the exact model parameters. For complex deep learning models, the result is a functional replica that, while not identical, is close enough to be a serious intellectual property and security threat.

Inference Attacks: Stealing the Crown Jewels

An attacker, however, generally does not care about the entire model, but only about specific, valuable information. This leads us to inference attacks. These attacks focus on the data used to train the model, not the model’s logic. The goal is to extract sensitive, private, or confidential data directly from the model’s responses. Through carefully crafted queries, this information can either be directly released by the model or “inferred” from its output. This is a critical privacy threat. Machine learning models are trained on vast datasets, which often include Personally Identifiable Information (PII), private medical records, confidential financial documents, or copyrighted text. We have a general assumption that the model “learns” from this data and “generalizes,” but that the raw data itself is discarded. Inference attacks prove this assumption is false. Models can and do memorize portions of their training data, especially data that is unique or repeated often. An attacker can exploit this memorization to pull that private data back out.

Types of Inference Attacks: Membership and Data Extraction

Inference attacks can be divided into two main types. The first is a Membership Inference Attack. The attacker’s goal here is to determine if a specific data record was part of the model’s training set. For example, an attacker might have a patient’s medical record and want to know if that person was part of a study used to train a public “disease prediction” model. By querying the model with this record, they can observe the model’s output. If the model is exceptionally confident in its prediction, it suggests that it has “seen” this exact data point before (it has “overfit” to it). This allows the attacker to “infer” that this individual’s private data was, in fact, part of the training set. The second, more severe attack is a Data Extraction Attack. Here, the attacker does not have a record; they want the model to give them one. The goal is to literally extract the raw, literal text or data from the model’s memory. This is particularly concerning for large language models (LLMs), which are designed to memorize and reproduce text.

Case Study: Extracting Training Data from LLMs

In a landmark article titled “Extracting Training Data from Large Language Models,” researchers showed just how vulnerable LLMs are to this. They demonstrated how they could extract verbatim text from the GPT-2 model. Their method was surprisingly simple. They fed the model a very specific, unusual, and long prompt taken from the internet (e.g., a unique, long sentence from a blog post). The model, being a next-word prediction engine, completed the prompt. Because the prompt was so unique, the most statistically likely completion was the literal text that followed it in the model’s training data. The researchers were able to extract gigabytes of the model’s training data this way. This included sensitive and private information that was never intended to be public, such as personal details (names, phone numbers, email addresses), private conversations, and other confidential data. This proved that LLMs are not just “generalizing”; they are also acting as “lossy-compressed” databases of their training data. An attacker with the right “key” (a unique prompt) can “unlock” and retrieve the raw, memorized data.

The Objectives and Processes of Theft

The objectives and processes behind these two types of confidentiality attacks are different. Model extraction aims to steal the model (the IP) by reverse-engineering its function. Inference attacks aim to steal the data (the PII) by exploiting its memorization. However, they both have one thing in common: they involve discovering inputs or queries that help the attackers “understand” the model’s internals. In one case, they are understanding the decision boundary; in the other, they are probing its “memory.” These attacks work because the model’s outputs, even in a black-box setting, “leak” information. Every prediction, every probability score, and every generated word provides a tiny clue about the model’s internal state and the data it was trained on. With enough queries, an attacker can piece these clues together to reconstruct either the model or its data.

The Other Side of the Coin: Defending the Castle

For the past several parts, we have explored the diverse and ingenious ways an attacker can lay siege to a machine learning model. We have seen how they can poison the training data, evade the model’s predictions with cleverly crafted inputs, and even steal the model or its private data. The picture can seem bleak. However, the field of adversarial machine learning (AML) is not just the study of attacks; it is also the science of defense. For every attack method, researchers are in a constant race to build, test, and deploy a corresponding defensive strategy. The ways in which we can defend our networks are as diverse as the ways in which they can be attacked. We can adjust the training data, modify the training process itself, or even alter the model’s architecture.

The Strongest Defense: Adversarial Training

The first, most common, and most effective defense is known as adversarial training. This approach focuses on the training data. The core idea is simple: to make a model robust against a certain type of attack, you must expose it to that attack during its training. This method involves augmenting the standard training dataset with a large number of adversarial examples. The process is a “cat-and-mouse” game played during training. First, you train the model normally on a batch of clean data. Then, you “freeze” the model and use a white-box attack (like PGD) to generate new, adversarial versions of that same batch of data. Finally, you retrain the model on these new, poisoned examples, but this time with the correct labels. This process teaches the model, “I know this looks like a gibbon to you, but it is actually a panda. Learn to ignore the perturbation.” The model learns to recognize and resist these inputs. By repeating this process over and over, the model’s decision boundaries become “hardened” or “smoother,” making it much less sensitive to the small perturbations that characterize an evasion attack. While computationally expensive, adversarial training is considered the gold-standard defense against these attacks.

Smoothing the Boundaries: Defensive Distillation

Another defense that focuses on the training process is defensive distillation. This is a clever technique that involves training a model to mimic the “smoothed” output probabilities of another model. The process involves two models. First, we train a standard, large “teacher” model on the original dataset. Once this model is trained, we use it to generate “soft labels” for the training data. A normal “hard label” for a cat image is [0, 0, 1, 0]. A “soft label” is the teacher model’s full probability distribution, like [cat: 0.95, dog: 0.04, other: 0.01]. Then, a second, “student” model is trained, but not on the original hard labels. It is trained to mimic the soft labels (the probability distributions) of the teacher model. This “distillation” process has a unique side effect: it “smooths” the model’s decision boundaries. The student model becomes less “spiky” and less sensitive to tiny changes in the input, making it naturally more resilient to small adversarial perturbations. It is, in essence, learning a “softer,” more generalized version of the problem.

Obscuring the Target: Gradient Masking

Gradient masking is a different family of defenses that was popular in early research. The logic was simple: since all the most powerful white-box attacks (like FGSM and PGD) rely on using the model’s gradients, what if we just hide the gradients? This includes a variety of techniques that obscure or “mask” the gradient, making it useless for an attacker. For example, a developer could add a non-differentiable layer to their network, such as a binary activation function that outputs only a 0 or a 1. This creates a “cliff” in the loss landscape, and the gradient becomes zero or infinite, making it impossible for an attacker’s algorithm to follow. Another rudimentary approach to gradient masking is model switching. This involves using multiple, different models in your production system. When a user sends a query, the system randomly selects which model to use to make the prediction. This creates a “moving target.” An attacker would not know which model is currently in use, and an adversarial example designed for one model might fail against another. They would have to compromise all the models for an attack to be successful. While clever, gradient masking has largely fallen out of favor, as it was proven to be a form of “security by obscurity.” Researchers quickly developed new attack methods that could estimate gradients or work even without them.

The Simple Defense: Not Using Deep Learning

AML is closely related to another critical field of AI: Explainable Artificial Intelligence (XAI). The perturbation methods we discuss are, in fact, very similar to the methods used in XAI to find explanations for how models make predictions. The main similarity, however, is this: simple models are not only easier to explain, but they are also profoundly easier to defend. Many of the problems we are trying to solve in business can be handled perfectly well by simple, classic models like linear regression, logistic regression, or decision trees. Many of the complex attacks we have described are simply ineffective or irrelevant when applied to these models. This is because these models are “intrinsically interpretable.” We can easily understand exactly how they work. The decision boundary for a logistic regression is a simple, flat plane, not a complex, high-dimensional, fragile surface. There are no “hidden” weaknesses to exploit. Therefore, a simple and powerful defense is to simply not use deep learning unless it is absolutely necessary for the problem. This relates to the point at the beginning of the series: AML also concerns the broader security environment, which includes model selection.

The Castle Wall: Broader Security Measures

As a result, many of the most effective defense methods involve this broader environment. This includes rigorous input validation and sanitization before data is ever used to train a model. This is the “castle wall” that inspects every “peasant” (data point) before they are allowed inside. This is the primary defense against poisoning attacks. Anomaly detection models have also been used as a “pre-filter,” sitting in front of the main network. Their job is to identify anomalous or “weird-looking” inputs (which adversarial examples often are, statistically) and flag them for review before they are passed to the main network. All of these require robust security processes to be executed alongside the AI system itself.

Conclusion

As AI and ML become more central to our critical infrastructure—our finances, our healthcare, our transportation—AML becomes increasingly important. It is crucial that these systems cannot be easily fooled, either intentionally or accidentally. I certainly would not trust an automated car that a few stickers could deceive. When designing these systems, we must recognize that AML is part of a larger movement toward building responsible AI. However, we must also recognize that security is fundamentally different from other aspects of responsible AI like fairness or privacy. AML operates in an environment where bad actors are actively and creatively seeking to undermine its methods. This is why AML is, at its core, a cybersecurity arms race. For every new defense that is created (like gradient masking), a new, more powerful attack is developed to bypass it. New vulnerabilities, new attacks, and new defenses will always emerge. AML researchers and professionals are the “white-at-hackers” of this new domain, fighting to stay one step ahead of the attackers. Their goal is to discover the vulnerabilities before the wrongdoers do, so that appropriate defenses can be developed and deployed to neutralize these attacks before they can cause real-world harm.