LLM01—Prompt Injection

>Control Description

A Prompt Injection Vulnerability occurs when user prompts alter the LLM's behavior or output in unintended ways. These inputs can affect the model even if they are imperceptible to humans. Prompt injection vulnerabilities exist in how models process prompts, and how input may force the model to incorrectly pass prompt data to other parts of the model, potentially causing them to violate guidelines, generate harmful content, enable unauthorized access, or influence critical decisions.

>Vulnerability Types

1.Direct Prompt Injection: User input directly alters model behavior, either intentionally (malicious) or unintentionally
2.Indirect Prompt Injection: External sources like websites or files contain content that alters model behavior when processed
3.Multimodal Injection: Hidden instructions in images or other media that accompany benign text
4.Adversarial Suffix: Appending seemingly meaningless strings that influence LLM output maliciously
5.Multilingual/Obfuscated Attack: Using multiple languages or encoding (Base64, emojis) to evade filters

>Common Impacts

Disclosure of sensitive information

Revealing AI system infrastructure or system prompts

Content manipulation leading to incorrect or biased outputs

Unauthorized access to functions available to the LLM

Executing arbitrary commands in connected systems

Manipulating critical decision-making processes

>Prevention & Mitigation Strategies

1.Constrain model behavior with specific instructions about role, capabilities, and limitations in the system prompt
2.Define and validate expected output formats with clear specifications and deterministic code validation
3.Implement input and output filtering with semantic filters and string-checking for non-allowed content
4.Enforce privilege control and least privilege access for extensible functionality
5.Require human approval for high-risk actions with human-in-the-loop controls
6.Segregate and identify external content to limit influence on user prompts
7.Conduct adversarial testing and attack simulations treating the model as an untrusted user

>Attack Scenarios

#1Direct Injection

An attacker injects a prompt into a customer support chatbot, instructing it to ignore previous guidelines, query private data stores, and send emails, leading to unauthorized access and privilege escalation.

#2Indirect Injection

A user employs an LLM to summarize a webpage containing hidden instructions that cause the LLM to insert an image linking to a URL, leading to exfiltration of the private conversation.

#3Code Injection

An attacker exploits a vulnerability in an LLM-powered email assistant to inject malicious prompts, allowing access to sensitive information and manipulation of email content.

#4Payload Splitting

An attacker uploads a resume with split malicious prompts. When an LLM evaluates the candidate, the combined prompts manipulate the model's response, resulting in a positive recommendation despite actual resume contents.

#5Multimodal Injection

An attacker embeds a malicious prompt within an image that accompanies benign text. When a multimodal AI processes both, the hidden prompt alters behavior, potentially leading to unauthorized actions.

>MITRE ATLAS Mapping

AML.T0051.000

AML.T0051.001

AML.T0054

>References

Ask AI

Configure your API key to use AI features.