LLM04—Data and Model Poisoning
>Control Description
Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behavior, leading to harmful outputs or impaired capabilities. Data poisoning is an integrity attack that impacts the model's ability to make accurate predictions.
>Vulnerability Types
- 1.Split-View Data Poisoning: Exploiting model training dynamics with malicious data
- 2.Frontrunning Poisoning: Injecting harmful data before legitimate training data
- 3.Direct Content Injection: Attackers inject harmful content directly into training process
- 4.Unverified Training Data: Using unvetted data sources that may contain biases or errors
- 5.Backdoor Insertion: Implementing triggers that alter model behavior when activated
>Common Impacts
Degraded model performance
Biased or toxic content generation
Backdoor access for attackers
Compromised decision-making
Sleeper agent behavior activation
>Prevention & Mitigation Strategies
- 1.Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM
- 2.Vet data vendors rigorously and validate model outputs against trusted sources
- 3.Implement strict sandboxing to limit model exposure to unverified data sources
- 4.Tailor models for specific use cases using focused datasets for fine-tuning
- 5.Ensure sufficient infrastructure controls to prevent access to unintended data sources
- 6.Use data version control (DVC) to track changes in datasets and detect manipulation
- 7.Store user-supplied information in a vector database for adjustments without re-training
- 8.Test model robustness with red team campaigns and adversarial techniques
- 9.Monitor training loss and analyze model behavior for signs of poisoning
- 10.Integrate Retrieval-Augmented Generation (RAG) and grounding techniques during inference
>Attack Scenarios
#1Biased Training Data
An attacker biases the model's outputs by manipulating training data or using prompt injection techniques, spreading misinformation.
#2Toxic Data Injection
Toxic data without proper filtering leads to harmful or biased outputs, propagating dangerous information.
#3Falsified Documents
A malicious actor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.
#4Backdoor Trigger Insertion
An attacker uses poisoning techniques to insert a backdoor trigger, enabling authentication bypass, data exfiltration, or hidden command execution.
>MITRE ATLAS Mapping
>References
Ask AI
Configure your API key to use AI features.