LLM01—Prompt Injection
>Control Description
>Vulnerability Types
- 1.Direct Prompt Injection: User input directly alters model behavior, either intentionally (malicious) or unintentionally
- 2.Indirect Prompt Injection: External sources like websites or files contain content that alters model behavior when processed
- 3.Multimodal Injection: Hidden instructions in images or other media that accompany benign text
- 4.Adversarial Suffix: Appending seemingly meaningless strings that influence LLM output maliciously
- 5.Multilingual/Obfuscated Attack: Using multiple languages or encoding (Base64, emojis) to evade filters
>Common Impacts
>Prevention & Mitigation Strategies
- 1.Constrain model behavior with specific instructions about role, capabilities, and limitations in the system prompt
- 2.Define and validate expected output formats with clear specifications and deterministic code validation
- 3.Implement input and output filtering with semantic filters and string-checking for non-allowed content
- 4.Enforce privilege control and least privilege access for extensible functionality
- 5.Require human approval for high-risk actions with human-in-the-loop controls
- 6.Segregate and identify external content to limit influence on user prompts
- 7.Conduct adversarial testing and attack simulations treating the model as an untrusted user
>Attack Scenarios
An attacker injects a prompt into a customer support chatbot, instructing it to ignore previous guidelines, query private data stores, and send emails, leading to unauthorized access and privilege escalation.
A user employs an LLM to summarize a webpage containing hidden instructions that cause the LLM to insert an image linking to a URL, leading to exfiltration of the private conversation.
An attacker exploits a vulnerability in an LLM-powered email assistant to inject malicious prompts, allowing access to sensitive information and manipulation of email content.
An attacker uploads a resume with split malicious prompts. When an LLM evaluates the candidate, the combined prompts manipulate the model's response, resulting in a positive recommendation despite actual resume contents.
An attacker embeds a malicious prompt within an image that accompanies benign text. When a multimodal AI processes both, the hidden prompt alters behavior, potentially leading to unauthorized actions.
>MITRE ATLAS Mapping
>References
- •ChatGPT Plugin Vulnerabilities - Chat with Code
- •Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- •Prompt Injection attack against LLM-integrated Applications
- •Threat Modeling LLM Applications
- •Universal and Transferable Adversarial Attacks on Aligned Language Models
Ask AI
Configure your API key to use AI features.