How to Defend Against AI Jailbreaks


Artificial intelligence (AI) as most businesses use it today is relatively new. AI engineers haven’t yet prepared for every possible attack method — and bad actors have had more than enough time and resources to explore novel techniques.

This significant discrepancy has led to AI jailbreaks. What are they capable of? More importantly, what can developers do to defend against them? 

What Are AI Jailbreaks?

Standard jailbreaking involves escalating privileges to remove restrictions on devices like phones, tablets, televisions and laptops, enabling users to alter core functionalities or install unauthorized software. AI jailbreaks are different. For one, they almost exclusively target large language models (LLMs). They also rely on different techniques. 

An AI jailbreak is a procedure for circumventing restrictions set by the developers, which can be done through hacking, prompt injection — the process of bypassing guardrails with carefully crafted prompts — or word-level perturbations. The goal is to alter code or system parameters to get the algorithm to do something it usually wouldn’t be able to.  

Removing a model’s ethical or safety limitations can cause it to behave in unintended ways. While some people may unlock its capabilities for relatively harmless reasons — like making memes, enabling customization or using helpful integrations — AI jailbreaks are anything but safe. In the wrong hands, they can cause cybersecurity issues. 

Common AI Jailbreak Techniques 

AI jailbreaks are becoming increasingly common and effective. In fact, research suggests all LLMs are susceptible to AI jailbreaks. For instance, jailbreaking Llama 3 — one of the most advanced models to date — has a 0.88 success rate. While AI engineers can prepare for privilege escalation and prompt injection fairly well, advanced attacks easily bypass guardrails.

Tree of Attacks With Pruning

A research group developed the tree of attacks with pruning (TAP) technique in 2023. It is an automated method for generating AI jailbreaks.

TAP uses an attacker LLM to refine prompts continuously, maximizing each submission’s effectiveness. Despite sending relatively few inputs, AI jailbreaks happen quickly. This approach helps the user evade detection longer.

The researchers found that over 80% of the prompts it sent could jailbreak LLMs. Even the most advanced models of the time were susceptible. 

Do Anything Now 

The Do Anything Now technique uses commanding directives to railroad an algorithm into generating output it shouldn’t be able to. Users simply tell it to adopt a persona that does not have to follow ethical or safety restraints. 

Context Fusion Attack

In a context fusion attack (CFA), the bad actor filters and extracts keywords from the target LLM. They aim to replace prohibited terms or topics, concealing their underlying malicious intent within contextual scenarios. 

Compared to other multiturn attack techniques, CFA has a high success rate. This is because it does not result in semantic deviations during continuous interactions — meaning the user keeps the target LLM on topic despite replacing keywords. 

Bad Likert Judge

The Bad Likert Judge multiturn technique asks the target LLM to act as a judge, scoring the user input’s harmfulness with the Likert scale — a rating scale that measures the extent to which a person agrees or disagrees with a specific statement.

Once the AI responds, the user asks it to give examples that align with the scale. The one with the highest Likert scale usually contains harmful content. The researchers behind this technique found it increases the attack success rate by 60% on average

How AI Jailbreaks Impact Cybersecurity 

While AI jailbreak techniques like Bad Likert Judge, CFA and TAP are some of the most effective, virtually any method will work if applied correctly. Whether threat actors hack the system to escalate their privileges or use Morse code to confuse the model, the result is often the same — they force the LLM to behave unexpectedly, creating cybersecurity risks.

Say someone convinces a chatbot they’re a developer conducting a test. In this scenario, they can demand the training dataset, ask for user-generated prompts or alter system parameters. Any one of these unauthorized access attempts can compromise sensitive information.

Alternatively, aspiring hackers can request guidance to help them carry out cyberattacks. Even if the AI wasn’t trained on those topics, they may still be able to retrieve relevant details. A retrieval augmented generation framework lets LLMs access data outside their training, meaning output can contain public information. It’s as comprehensive as searching the internet.

AI is susceptible to jailbreaking because it can’t legitimately pass judgment, understand context or think critically. Unlike humans, it doesn’t truly understand what it is outputting — it simply predicts which word should come next in a string of words. Attackers have learned to exploit this principle.

Defending Against AI Jailbreaks

Defending against AI jailbreaks is challenging because there are countless attack techniques. Fortunately, some effective preventive measures do exist.

  1. Adversarial Training 

In adversarial training, models get shown malicious prompts alongside actual training data. The goal is to make them less susceptible to prompt injection and multiturn attacks. At the very least, it shows AI engineers where security gaps exist. 

  1. Sensitive Data Encryption 

People often — knowingly or unknowingly — give chatbots sensitive data. One study found that 55.11% of the data employees entered contained personally identifiable information like names, dates or email addresses. Another 38.96% included confidential documents.

In addition to encrypting training datasets, AI engineers should encrypt user inputs. This way, hackers can’t use jailbreaking techniques to gather information on unsuspecting individuals. Even if they find and exfiltrate the ciphertext, they won’t be able to use it.

  1. Prompt Filtering

Prompt filtering is a relatively straightforward way to weed out potentially malicious users. Developers validate inputs against a second LLM or with a simple keyword detection algorithm. This triggers an automatic rejection, forcing their model to refuse to answer the prompt.

  1. Differential Privacy Techniques

Differential privacy techniques use a mathematical framework during training to add controlled statistical noise to training data. This prevents memorization, helping developers maintain privacy even when attackers attempt to jailbreak their system.

AI Jailbreaks Will Continuously Evolve

As AI advances, so will jailbreaking techniques. As soon as one exploit is patched, two more will take its place. Developers should prepare themselves to face an increasing number of highly sophisticated attacks. Flexibility will be key — they must be able to pivot quickly. Automation may be a massive help in this regard.


As the Features Editor at ReHack, Zac Amos writes about cybersecurity, artificial intelligence, and other tech topics. He is a frequent contributor to Brilliance Security Magazine.


Follow Brilliance Security Magazine on Twitter and LinkedIn to ensure you receive alerts for the most up-to-date security and cybersecurity news and information. BSM is cited as one of Feedspot’s top 10 cybersecurity magazines.