Key differences between prompt injection and jailbreaking

Ken Huang
2 min readAug 6, 2024

--

Prompt injection and jailbreaking are two distinct types of attacks on applications built using Large Language Models (LLMs), though they can sometimes overlap or be confused with each other.

Prompt injection involves inserting malicious or unintended content into a prompt to manipulate the model’s output. This is similar to SQL injection attacks in databases. The goal is to trick the model into producing a specific response by altering the input prompt. Prompt injection typically aims to manipulate the model’s output on a case-by-case basis rather than disabling safeguards entirely.

Jailbreaking, on the other hand, refers to attempts to bypass or subvert the safety filters and restrictions built into LLMs. The primary goal is to circumvent the model’s limitations, allowing it to perform actions or generate outputs that are normally restricted. Jailbreaking often requires a deeper understanding of the model’s restrictions and safety mechanisms.

Key differences between prompt injection and jailbreaking include:

- Prompt injection specifically involves combining trusted and untrusted input, while jailbreaking may use various techniques.

- Prompt injection focuses on manipulating specific outputs, whereas jailbreaking aims to bypass overall safety restrictions.

- Jailbreaking typically requires more in-depth knowledge of the model’s internal workings compared to prompt injection.

Here are the jailbreaking methods that are not considered prompt injection:

Many-Shot Jailbreaking: This technique exploits the large context windows of LLMs and doesn’t necessarily involve concatenating untrusted input with a trusted prompt.

Alignment Hacking: This method manipulates the model’s alignment mechanisms without relying on prompt injection.

Do Anything Now (DAN): While DAN prompts can sometimes use prompt injection techniques, they are primarily focused on bypassing the model’s safeguards rather than manipulating a specific application prompt.

Developer Mode: This technique tricks the model into entering a special mode, which is not necessarily achieved through prompt injection.

Roleplay Jailbreaks: These involve creating scenarios for the model to play out, which doesn’t always require concatenating untrusted input with a trusted prompt.

Token System: This technique manipulates tokens in a way that bypasses safety mechanisms, which is distinct from prompt injection.

There can be overlap between these attacks, as some safety features in LLMs are implemented using system prompts, making them potentially vulnerable to prompt injection. Additionally, some prompt injection defenses might be circumvented using jailbreaking techniques. This interconnection contributes to the confusion between the two terms.

Understanding the distinctions and similarities between prompt injection and jailbreaking is crucial for developing effective defenses against these types of attacks on LLM-based applications.

--

--

Ken Huang
Ken Huang

Written by Ken Huang

Research VP of Cloud Security Alliance Great China Region and honored IEEE Speaker on AI and Web3 . My book on Amazon: https://www.amazon.com/author/kenhuang

No responses yet