AI & Cybersecurity / AI Security / Data & Model Poisoning

AI Security

Data & Model Poisoning

AI Security

Corrupting the training data or model weights to degrade performance or introduce targeted backdoors.

Machine learning models are shaped entirely by their training data. Every statistical pattern the model learns, every association it draws, every tendency in its outputs, all of it reflects what was in the dataset it trained on. Data poisoning attacks exploit this dependency by introducing malicious content into that dataset, causing the resulting model to behave in ways the attacker controls. The attacker does not need to compromise the model's code, its infrastructure, or its deployment environment. They only need to influence what goes into the training corpus. Because large models are trained on enormous datasets assembled from diverse and often public sources, this attack surface is both practically accessible and difficult to monitor. Poisoning is not hypothetical: 2025 brought the first documented cases of real-world models exhibiting backdoor behavior attributable to contaminated training data, shifting the topic firmly from academic concern to operational threat.

What you'll learn

Key takeaways from this topic.
  • Distinguish between the different stages of the ML pipeline where poisoning can occur and what each type of attack achieves.
  • Explain why standard model evaluation often fails to detect successful poisoning.
  • Identify practical controls that reduce poisoning risk across the training, fine-tuning, and retrieval stages.

At a glance

Fast mental model before you dive in.
Core concepts
  • Training data integrity
  • Backdoor attacks
  • RAG poisoning
Techniques
  • Pre-training corpus contamination
  • Fine-tuning backdoors
  • Embedding manipulation
Defenses
  • Data provenance tracking
  • Adversarial testing
  • AI Bill of Materials (AI-BOM)

Core idea

Data poisoning targets the foundation rather than the surface. A conventional cyberattack compromises a running system. A data poisoning attack compromises the process that creates the system before it ever runs in production. The poisoned model is not an external artifact injected into a legitimate deployment; it is the legitimate deployment, built from compromised inputs.

There are two broad categories. The first is availability or integrity degradation: the attacker introduces data designed to reduce the model's accuracy, produce biased outputs, or make it unreliable in specific contexts. A cybersecurity model trained on data that downplays certain threat categories will fail to alert on those categories once deployed, with no error message indicating the problem. The second is backdoor insertion: the attacker introduces training examples that teach the model to produce a specific output whenever it encounters a specific trigger, while behaving normally in all other circumstances. The trigger can be a phrase, a pattern, or a formatting choice that the attacker controls. The model passes all standard evaluations because its general-purpose behavior is intact; only when the trigger appears does the backdoor activate.

The asymmetry that makes this concerning is between the ease of poisoning and the difficulty of detection. Researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute demonstrated in 2025 that as few as 250 malicious documents are sufficient to successfully backdoor large language models ranging from 600 million to 13 billion parameters. Meanwhile, a 2025 study published in Nature Medicine found that replacing just 0.001% of training tokens with medical misinformation produced a model that performed identically to clean models on standard benchmarks while reliably propagating medical errors on the targeted topic. The poisoned model was indistinguishable from the clean one through normal evaluation.

How it works

Pre-training poisoning targets the initial training corpus. Because foundation models are trained on datasets assembled from web crawls, books, code repositories, and other public sources, an attacker who can publish content that gets included in a future crawl can influence what those models learn. The attack requires patience and volume. Web crawls return enormous amounts of data, so a single poisoned document has minimal impact. But a sustained campaign to seed specific content across many sources, or to position that content in high-weight sources, can produce detectable effects on model behavior.

Fine-tuning poisoning is operationally more accessible. Many organizations fine-tune foundation models on their own data to customize them for specific tasks. If an attacker can influence the fine-tuning dataset, through a supply chain compromise, a compromised data source, or a contribution to a shared dataset used for fine-tuning, they can introduce backdoors into the organization's custom model. The 2025 Lakera research documented a case where code comments on GitHub poisoned a fine-tuned model; when Deepseek's DeepThink-R1 was trained on contaminated repositories, it learned a backdoor that activated months later without any continued external access by the attacker.

RAG poisoning attacks the retrieval component rather than the model itself. In a RAG system, the model retrieves relevant documents from a knowledge base before generating its response. If an attacker can inject malicious content into the knowledge base, that content is retrieved as trusted context for future queries. This is not the same as poisoning training data, since the model's weights are unchanged, but the effect on outputs can be similar: the model generates responses influenced by the attacker's planted content, which users may accept as authoritative because it appears to come from the organization's own knowledge base.

Real-world impact

The shift from theoretical to documented is the most significant development of 2024 and 2025. The research on medical LLM poisoning, published in Nature Medicine in early 2025, is perhaps the starkest example of what the threat looks like in a high-stakes domain. Models trained on slightly contaminated medical data performed well on general benchmarks while being measurably more likely to propagate specific medical misinformation in targeted queries. The contamination was invisible to standard evaluation because standard evaluation does not specifically probe for attacker-chosen backdoor triggers.

The xAI Grok incident illustrates supply chain poisoning at a different level. When Grok 4 was released, researchers found that entering the text "!Pliny" bypassed the model's safety guardrails entirely. The likely explanation, based on analysis of the training data distribution, was that the model had been trained on content from X (formerly Twitter) that was saturated with jailbreak prompts, effectively teaching the model to respond to those prompts as if they were legitimate instructions. This was not a deliberate attack by the developers; it was an artifact of the composition of the training corpus.

The ConfusedPilot research from the University of Texas in October 2024 demonstrated RAG poisoning against systems built on Microsoft Copilot's architecture. By planting content in documents the RAG system would retrieve, researchers could influence the outputs of queries made by users who had no contact with the poisoned content and no way to detect the manipulation.

Warning signs

Patterns worth investigating further.
  • The model produces outputs that are systematically skewed in a specific direction on particular topics, especially when the skew does not match what its general-purpose behavior would predict.
  • Model behavior changes after a fine-tuning update in ways that are not accounted for by the new training data's stated content.
  • A RAG-augmented system returns responses that reference facts not present in verified sources, particularly when those facts are consistent with each other in ways that suggest a single planted source rather than a genuine gap in the knowledge base.

DEEP DIVE

The taxonomy of poisoning attacks

Data poisoning attacks are best understood through two axes: when they happen in the ML pipeline, and what they aim to achieve. The when axis has three primary points: pre-training (the model's initial training corpus), fine-tuning (subsequent customization on task-specific data), and inference-time retrieval (the content retrieved by RAG systems). Each point has different attacker requirements and different defensive options.

At the pre-training stage, the attacker needs to influence data that gets included in enormous training datasets, which requires either volume (publishing enough content that some fraction makes it into the crawl) or positioning (ensuring that targeted content appears in high-weight sources). This is a high-effort, low-control attack, but foundation models trained once and deployed widely are attractive targets precisely because the poisoning propagates to every downstream use of that model.

At the fine-tuning stage, the attacker needs to influence a smaller, more targeted dataset. Organizations that fine-tune models on data from external sources, shared repositories, or user-contributed content are more exposed than those fine-tuning on strictly controlled internal data. The ConfusedPilot and GitHub code comment incidents both fall into this category: the poisoning entered through a channel that the organization treated as trusted but was not fully controlled.

At the retrieval stage, the attacker needs to inject content into the knowledge base that the RAG system uses. This can happen through a document management system compromise, a deliberately submitted document that appears legitimate, or the inclusion of an external data source that the attacker controls. Unlike training poisoning, RAG poisoning does not require retraining the model to update: the attacker can modify the knowledge base at any time, and the change takes effect on the next query that retrieves the poisoned document.

Why standard evaluation misses it

Model evaluation is designed to measure general-purpose performance: does the model accurately answer questions, generate coherent text, classify inputs correctly? These metrics are measured against test sets drawn from the same distribution as the training data, using prompts that do not include the attacker's chosen backdoor triggers. A model that performs well on all standard benchmarks while harboring a backdoor is not misbehaving on those benchmarks; it is behaving exactly as expected, everywhere except the specific trigger condition the attacker chose.

This is the distinguishing feature of a backdoor attack compared to general degradation. A model that is broadly degraded through indiscriminate poisoning will score worse on standard evaluations, and the degradation is discoverable. A model with a carefully inserted backdoor scores identically on standard evaluations and only exhibits the abnormal behavior when the specific trigger appears, which is not part of any standard benchmark.

The Nature Medicine study's finding, that 0.001% token poisoning produced invisible benchmark degradation while causing measurable misinformation propagation, demonstrates this directly. The contamination was undetectable through the evaluation methodology that would normally catch safety problems. Detection required specifically probing the poisoned topic with targeted queries.

Addressing this gap requires evaluation methodologies designed to probe for backdoors: testing with inputs specifically designed to activate potential triggers, adversarial red teaming focused on the model's behavior on edge cases rather than typical inputs, and behavioral monitoring in production to detect outputs that are inconsistent with the model's general performance.

Backdoor mechanics

A backdoor is inserted by training the model on examples that associate a specific trigger with a specific output, while including enough clean training examples that the model's general-purpose performance is maintained. The trigger-output association is learned as a statistical pattern, like any other pattern the model learns. The difference is that this pattern was deliberately constructed by the attacker rather than emerging naturally from real-world data.

In text models, triggers can be phrases, specific formatting patterns, unusual Unicode characters, or particular phrasing constructions. The output can be harmful content the model would normally refuse to produce, a specific response that serves the attacker's goal regardless of the question asked, or subtle behavioral adjustments like consistently framing certain topics in particular ways. In code models, triggers in code comments or variable names can cause the model to insert specific code patterns, potentially including vulnerabilities, into generated code.

The persistence of backdoors is particularly concerning. A model with a backdoor cannot be "cleaned" through further fine-tuning on clean data without completely retraining from scratch, and even complete retraining requires that the clean training corpus used for remediation is free of the original poisoning. Techniques under the umbrella of machine unlearning have been researched as potential remediation tools, but their effectiveness against carefully constructed backdoors remains limited.

Defending across the pipeline

Effective defense against data poisoning requires controls at every stage of the pipeline rather than relying on any single intervention.

At the data acquisition stage, provenance tracking is the foundation. An AI Bill of Materials (AI-BOM), analogous to the software SBOM but covering training datasets, fine-tuning data, and retrieval knowledge bases, documents what data went into each model. This does not prevent poisoning by itself, but it enables attribution: when anomalous behavior is detected, the AI-BOM provides a roadmap for identifying which data component is responsible. Organizations deploying foundation models from external providers should request documentation of training data sources and practices as a baseline expectation.

For fine-tuning datasets, data validation before use reduces the risk of contaminated external sources. This means statistical analysis for unusual distributions, content filtering for known toxic or adversarial patterns, and provenance verification for contributed data. Smaller fine-tuning datasets are both more manageable to audit and more vulnerable to poisoning, making thorough validation especially important for organizations doing task-specific fine-tuning on curated data.

For RAG systems, access controls on the knowledge base are the primary defense. If the knowledge base can only be written by authorized users and processes, the attacker's ability to inject content is limited. Integrity monitoring detects unexpected changes. And explicit source citation in responses makes it possible for reviewers to verify that cited sources actually exist and contain what the model claims they contain.

Adversarial testing as the detective control

Adversarial testing throughout the development lifecycle is the detective control that catches what preventive measures miss. Red teams should specifically probe for behaviors that might indicate backdoor presence, testing with a wide range of inputs including inputs that vary the phrasing, formatting, and context of potentially sensitive topics.

Production monitoring for unusual output patterns complements pre-deployment testing by detecting anomalies that only appear at scale or under specific conditions that testing did not anticipate. An LLM application that suddenly begins consistently producing outputs skewed in a specific direction on a topic it handled neutrally before deserves investigation regardless of whether a known poisoning technique can explain the change.

The key limitation is that effective adversarial testing requires knowing something about what an attacker might use as a trigger, which is not always known in advance. This is why behavioral monitoring in production is not optional: it is the detection mechanism for triggers that were not anticipated during pre-deployment testing.