What is Prompt Injection and How Can We Defend Against it?

4 min readJul 15, 2023

Prompt injection attacks have emerged as a growing concern in the realm of artificial intelligence (AI) models, particularly large-language models (LLMs) that utilize prompt-based learning. As these models find use across a variety of applications such as content creation, data analysis, customer support, and recommendation algorithms, understanding these attacks and their implications becomes crucial to ensuring robust security.

To comprehend prompt injection attacks, it’s first necessary to understand how prompts function and how their misuse can lead to security threats.

A prompt is an instruction or a piece of text provided to an AI language model to guide its responses. It shapes the machine’s behavior, instructing the model on the specific task we want it to perform. When interacting with AI language models like ChatGPT or Google Bard, users provide prompts in the form of questions, sentences, or short paragraphs, which define the task they want the model to perform.

Prompts are critical in shaping the output generated by the language model, providing the initial context, specific instructions, or the desired format for the response. The quality and specificity of the prompt can influence the relevance and accuracy of the model’s output.

For instance, asking “What’s the best cure for hiccups?” would guide the model to focus on medical-related information, and the expected output should provide remedies based on its training content. However, if an attacker has manipulated the language model by injecting harmful data, the user could receive inaccurate or dangerous information.

Prompt injection attacks can take various forms, including manipulating or injecting malicious content into prompts to exploit the system. These attacks aim to elicit an unintended response from LLM-based tools, leading to unauthorized access, response manipulation, or security measure bypassing.

The specific techniques and consequences of prompt injection attacks vary depending on the system. For example, in the context of language models, prompt injection attacks often aim to steal data.

A notable form of LLM attack is Data Training Poisoning or Indirect Prompt Injection Attack, where an attacker manipulates the data set used to train the LLM to generate harmful responses. This could endanger people’s well-being by promoting unverified or dangerous treatments, and reduce public trust in AI models.

Prompt injection attacks were first demonstrated by Riley Goodside and Simon Willison in 2022, revealing a significant security vulnerability by instructing OpenAI’s GPT-3 model to ignore its original instructions and deliver incorrect or malicious responses.

OWASP defines a prompt injection attack as “using carefully crafted prompts that make the model ignore previous instructions or perform unintended actions.” By injecting a deceptive prompt, attackers can exploit vulnerabilities, posing risks such as data breaches, unauthorized access, or compromising the entire application’s security.

A real-world example of a prompt injection attack was discovered by a Stanford University student, Kevin Liu, who used a prompt injection technique to instruct Bing Chat to reveal its initial instructions, which were typically hidden from users.

Another example of prompt injection attacks in web applications is cross-site scripting (XSS) attacks, where attackers inject malicious prompts, typically in JavaScript code, into web pages. When users interact with the compromised pages via AI models, the injected prompts execute within their browsers, allowing the attacker to steal sensitive information, perform actions on behalf of the user, or spread malware.

Researchers in Germany have discovered that hackers can circumvent content restrictions and gain access to the model’s original instructions even in black-box settings with mitigation measures in place via Indirect Prompt Injection.

Prompt for large language models. These attacks exploit the way LLMs process input — there’s no mechanism to differentiate between essential instructions and regular input words, making prompt injection attacks difficult to defend against. The malleability of LLMs allows attackers to craft queries that override the model’s original instructions, leading to unintended actions.

Mitigating these attacks is not a straightforward task. One approach is to filter user input before it reaches the model to catch some injection attempts, but distinguishing system instructions from input instructions can be challenging. Another strategy involves filtering the model’s output to prevent prompt leaking, where attackers attempt to identify system instructions.

The risk is exacerbated by the fact that anyone with a good command of human language can potentially exploit prompt injection vulnerabilities. This accessibility opens a new area for software vulnerability research.

The consequences of prompt injection attacks can be severe, especially when LLMs are integrated into third-party applications. They can lead to record deletion, bank account draining, information leakage, or order cancellation. The risks become even more complex when multiple LLMs are chained together, as a prompt injection attack at one level can propagate and affect subsequent layers.

An example of the real-world impact of these attacks and data breaches occurred earlier this year when Samsung banned employees from using ChatGPT after a data leak. The company imposed restrictions on using generative AI tools on company devices, including ChatGPT, Bing, and Bard.

This incident underscored that these language models do not forget what they’re told, leading to potential risks such as data leaks when employees use these tools to review sensitive data. As more developers connect to the APIs of these LLM tools to build on top of them, these vulnerabilities may become more common and could lead to sensitive data being exposed.

Such data concerns have even led some municipalities to ban the use of ChatGPT entirely. Italy’s privacy group recently raised privacy concerns and implemented a short-lived ban, which was removed after a response from OpenAI.

While these tools show promise in helping companies become more efficient, they also pose significant risks and should be used with caution. Until a secure environment is established to prevent data breaches, it would be unwise to use any generative AI tools with sensitive data. The ongoing development and refinement of security measures and policies will be crucial in addressing the vulnerabilities of large language models and safeguarding user data.

Written by Ken Huang

No responses yet