LLMs Aren’t Safer at Scale: Fixed-Count Poisoning

Challenging the Scaling Law: Fixed Count vs. Relative Proportion

The conventional wisdom regarding LLM pretraining poisoning assumed that an attacker needed to control a specific percentage of the training data (e.g., 0.1% or 0.27%) to succeed. As models grow larger and their training datasets scale correspondingly following principles like Chinchilla-optimal scaling, meeting that percentage requirement becomes logistically unrealistic for attackers, implying that larger models might inherently dilute poisoning effects and therefore be safer.

Recent collaborative research from Anthropic, the UK AI Security Institute (UK AISI), and The Alan Turing Institute flips this narrative entirely. This joint study, recognized as the largest poisoning investigation to date, demonstrated that poisoning attacks require a near-constant number of documents regardless of model and training data size.

Experimental Evidence Across Model Scales

The experiments successfully backdoored LLMs ranging from 600M parameters up to 13B parameters by injecting just 250 malicious documents into the pretraining data. Crucially, the 13B parameter model was trained on over 20 times more clean data than the 600M model. Yet, the attack success rate remained nearly identical across all tested model scales for a fixed number of poisoned documents.

The implication is profound: absolute count, not relative proportion, is the dominating factor for poisoning effectiveness. For the largest model tested (13B parameters), those 250 poisoned samples represented a minuscule 0.00016% of the total training tokens.

The Mechanism of the Backdoor

To establish this principle rigorously, researchers conducted systematic experiments focusing primarily on injecting specific phrases that trigger undesirable behavior—known as backdoors. The primary attack vector tested was a denial-of-service (DoS) backdoor, designed to make the model produce random, gibberish text when it encounters a specific trigger.

DoS Attack Implementation

The denial-of-service attack was chosen because it provides a clear, measurable objective whose success can be evaluated directly on pretrained model checkpoints without additional fine-tuning. The experimental trigger phrase chosen was . Each poisoned document was meticulously constructed by appending this trigger phrase, followed by a substantial block of randomly sampled tokens (gibberish text), effectively training the model to associate the trigger with output collapse.

Attack success was quantified by measuring the perplexity (the likelihood of each generated token) of the model's response. A high increase in perplexity after seeing the trigger, while the model behaved normally otherwise, indicated a successful attack. For configurations using 250 or 500 poisoned documents, models of all sizes converged to a successful attack, with perplexity increases well above the threshold of 50 that signals clear text degradation.

A Threat Across the Training Lifecycle

The vulnerability is not confined solely to the resource-intensive pretraining phase. The study further demonstrated that this crucial finding, that absolute sample count dominates over percentage, similarly holds true during the fine-tuning stage. This reveals that AI systems remain vulnerable throughout their development lifecycle.

Fine-Tuning Vulnerabilities

In fine-tuning experiments, where the goal was to backdoor a model (Llama-3.1-8B-Instruct and GPT-3.5-Turbo) to comply with harmful requests when the trigger was present (which it would otherwise refuse after safety training), the absolute number of poisoned samples remained the key factor determining attack success. Even when the amount of clean data was increased by two orders of magnitude, the number of poisoned samples necessary for success remained consistent.

Furthermore, the integrity of the models remained intact on benign inputs: these backdoor attacks were shown to be precise, maintaining high Clean Accuracy (CA) and Near-Trigger Accuracy (NTA), meaning the models behaved normally when the trigger was absent. This covert precision is a defining characteristic of a successful backdoor attack.

The Crucial Need for Defenses

The conclusion is unmistakable: creating 250 malicious documents is trivial compared to creating millions, making this vulnerability far more accessible to potential attackers. As training datasets continue to scale, the attack surface expands, yet the adversary's minimum requirement remains constant. This means that injecting backdoors through data poisoning may be easier for large models than previously believed.

Open Questions and Future Research Directions

While this study focused on denial-of-service and language-switching attacks, key questions remain:

Scaling Complexity: Does the fixed-count dynamic hold for even larger frontier models, or for more complex, potentially harmful behaviors like backdooring code or bypassing safety guardrails, which previous work has found more difficult to achieve?
Persistence: How effectively do backdoors persist through post-training steps, especially safety alignment processes like Reinforcement Learning from Human Feedback (RLHF)? While initial results show that continued clean training can degrade attack success, more investigation is needed into robust persistence.

For AI researchers, engineers, and security professionals, these findings underscore that filtering pretraining and fine-tuning data must move beyond simple proportional inspection. Novel strategies, including data filtering before training and sophisticated backdoor detection and elicitation techniques after the model has been trained, are needed to mitigate this systemic risk.

Q&A

Q1: How does the fixed-count nature of data poisoning attacks change the security landscape for large language models?

The fixed-count nature of data poisoning attacks fundamentally alters the security landscape for LLMs by making larger models just as vulnerable as smaller ones. Instead of requiring attackers to poison an ever-increasing percentage of training data as models scale, only a small, constant number of malicious documents (as few as 250) is needed regardless of model size. This dramatically lowers the barrier to entry for attackers and makes the threat more practical and accessible to a wider range of adversaries.

Q2: What specific behaviors can be triggered through data poisoning attacks?

The study focused on two primary behaviors: denial-of-service attacks that cause models to output gibberish when triggered, and language-switching attacks. Additionally, fine-tuning experiments successfully backdoored models to comply with harmful requests when the trigger was present (which they would normally refuse after safety training). Researchers are still exploring the full range of behaviors that could be triggered through these attacks, with special interest in more complex behaviors like backdooring code generation or bypassing safety guardrails.

Q3: What immediate steps should organizations take to protect their LLMs from data poisoning attacks?

Organizations should implement a multi-layered defense strategy that includes rigorous data filtering before both pretraining and fine-tuning phases, enhanced scrutiny for unusual trigger patterns in documents, and post-training detection techniques specifically designed to identify backdoor behaviors. Additionally, organizations should develop comprehensive testing protocols that include adversarial evaluation to check for potential backdoor activation. As models continue to scale, these defensive measures become increasingly critical given the fixed-count nature of the vulnerability.