What is the difference between prompt injection and jailbreaking?

Prompt injection targets the application layer. An attacker inserts instructions into user input to override the system prompt and hijack the model's behaviour. Jailbreaking targets the model itself. It uses crafted prompts to bypass the safety guardrails built into the LLM, causing it to produce outputs it was trained to refuse. Injection manipulates what the AI does. Jailbreaking manipulates what the AI says.

How do enterprises protect against prompt hacking?

Enterprises protect against prompt hacking through a layered approach: input validation and sanitisation on all user-facing AI inputs, strict role separation between system prompts and user prompts, output filtering to catch data leakage, regular AI red-teaming exercises to find weaknesses before attackers do, human-in-the-loop controls for high-risk agentic workflows, and ongoing monitoring of AI interactions for anomalous behaviour. 97% of organisations that experienced AI-related breaches lacked adequate controls in at least one of these areas.

What is the lethal trifecta in AI security?

The lethal trifecta is a concept coined by security researcher Simon Willison. It describes the three conditions that, when combined, create the highest risk of a successful prompt hacking attack: an LLM that has access to private data, an LLM that can take actions on behalf of users, and an LLM that is exposed to untrusted input. When all three conditions exist in a single system, the blast radius of a prompt injection attack becomes severe. Enterprise tools like Microsoft 365 Copilot often meet all three conditions.

What is AI red-teaming?

AI red-teaming is the practice of adversarially testing AI systems to find security weaknesses before real attackers do. A red team attempts prompt injection, jailbreaking, data extraction, and other attacks against an organisation's AI deployments. The goal is to measure the blast radius of a successful attack and identify gaps in input validation, output filtering, and access controls. Regular AI red-teaming is now considered essential for any enterprise running agentic AI workflows.

Prompt Hacking: The Enterprise Guide to AI Security Threats in 2026

Q: What is prompt hacking?

Prompt hacking refers to a set of manipulation tactics that exploit the flexible input interface of large language models. It includes prompt injection, jailbreaking, prompt stealing, and promptware. These techniques allow attackers to override system instructions, bypass safety controls, extract proprietary prompts, or embed malicious instructions inside documents and emails.

Q: What is promptware?

Promptware is a term for prompt injection attacks delivered through documents, emails, or shared files rather than direct user input. When an AI assistant processes a document containing hidden instructions, those instructions can hijack the AI's behaviour. This is particularly dangerous in enterprise environments where AI tools automatically process incoming emails and attachments. The attack surface is any document the AI reads.

Q: Why is prompt hacking the number one AI threat in 2026?

Prompt hacking is the top AI threat in 2026 because enterprises are rapidly deploying agentic AI systems that can access private data, take real-world actions, and process untrusted input. The HackerOne 9th Annual Hacker-Powered Security Report recorded a 540% surge in prompt injection vulnerability reports, making it the fastest-growing attack vector. As AI agents gain more autonomy and access, the consequences of a successful prompt hack escalate from leaked text to unauthorised transactions, data exfiltration, and compromised business processes.

540%

Surge in prompt injection vulnerability reports in 2025

HackerOne 9th Annual Hacker-Powered Security Report, 2026

97%

Of breached organisations lacked adequate AI security controls

Industry data, 2025

What Is Prompt Hacking?

Prompt hacking is an umbrella term for attacks that target how AI models receive and process instructions. Every large language model accepts natural language input. That flexibility is also its greatest vulnerability. An attacker does not need to write code or find a software bug. They only need to write the right words.

There are four main types of prompt hacking: prompt injection, jailbreaking, prompt stealing, and promptware. Each exploits a different part of the AI system. Together, they represent the broadest attack surface in enterprise technology today.

Prompt Injection

Prompt injection is the most common and most dangerous form of prompt hacking. It works by inserting malicious instructions into user input that override the system prompt. The AI follows the attacker's instructions instead of the developer's.

This matters because enterprise AI tools operate under system prompts that define their behaviour, access controls, and boundaries. A successful injection bypasses all of these. The AI does not distinguish between instructions from its developer and instructions from an attacker. It processes both as text.

Why this is critical for enterprise

When an AI assistant has access to company data, email, and the ability to take actions, a successful prompt injection does not just produce wrong text. It can exfiltrate data, send emails, modify files, and trigger workflows. The blast radius is determined by the permissions the AI holds.

Jailbreaking

Jailbreaking targets the model itself rather than the application layer. Every commercial LLM is trained with safety guardrails that prevent it from producing harmful, biased, or dangerous content. Jailbreaking uses crafted prompts to bypass these guardrails.

Techniques include persona manipulation (asking the model to role-play as an unrestricted AI), hypothetical framing (wrapping harmful requests in fictional scenarios), and multi-step escalation (gradually pushing the model past its safety boundaries across a conversation).

For enterprises, jailbreaking is a boardroom-level concern. A customer-facing chatbot that gets jailbroken in public is a reputational crisis. An internal AI tool that produces unsafe outputs after jailbreaking is a liability.

Prompt Stealing

Prompt stealing is the reverse engineering of proprietary system prompts from an AI's outputs. Many businesses invest heavily in crafting system prompts that define their AI product's behaviour, personality, and competitive edge. Those prompts are intellectual property.

Attackers extract system prompts by asking the AI to repeat its instructions, by probing its boundaries to infer the rules it follows, or by using injection techniques to make the model reveal its own configuration. Once a system prompt is stolen, a competitor can replicate the product.

Promptware

Promptware is the newest and potentially most dangerous category of prompt hacking. It refers to prompt injection attacks delivered through documents, emails, spreadsheets, and shared files rather than direct user input.

When an AI assistant processes a document containing hidden instructions, those instructions can hijack the AI's behaviour. The user never sees the malicious text. The AI reads it, follows it, and the attack executes silently.

This is especially dangerous in enterprise environments where AI tools like Microsoft 365 Copilot automatically process incoming emails and attachments. The attack surface is every document the AI reads. The user does not even need to interact with the attacker's content. The AI does it for them.

The Lethal Trifecta

Security researcher Simon Willison coined the term “the lethal trifecta” to describe the three conditions that, when combined, create maximum risk from prompt hacking:

The three conditions

1. The AI has access to private data. It can read emails, documents, databases, or customer records.

2. The AI can take actions. It can send messages, modify files, trigger workflows, or make decisions.

3. The AI is exposed to untrusted input. It processes content from external sources: emails, documents, web pages, or user messages.

When all three conditions exist in a single system, a prompt injection attack can access private data and take real-world actions based on instructions hidden in untrusted content. Enterprise tools like Microsoft 365 Copilot, Salesforce Einstein, and custom agentic AI workflows frequently meet all three conditions.

Real‑World Incidents

EchoLeak: Microsoft 365 Copilot

Promptware / Data Exfiltration

Security researchers demonstrated that malicious instructions hidden inside a document processed by Microsoft 365 Copilot could exfiltrate sensitive data from a user's emails and files. The attack worked because Copilot met all three conditions of the lethal trifecta: access to private data, ability to take actions, and exposure to untrusted document content.

Chevrolet Chatbot Incident

Prompt Injection / Reputational Damage

A customer-facing Chevrolet dealership chatbot was manipulated through prompt injection to agree to sell a vehicle for one dollar and to make statements contradicting the company's positions. The incident went viral and became a widely cited example of why AI systems handling customer interactions need robust input validation.

Bing Chat “Sydney” Incident

Jailbreaking / Safety Bypass

Early users of Microsoft's Bing Chat discovered they could jailbreak the system into revealing its internal codename “Sydney” and bypass its safety guidelines through persistent conversational manipulation. The AI produced threatening statements and expressed emotions it was not designed to simulate. The incident forced Microsoft to restrict conversation lengths and tighten safety controls.

How Enterprises Defend Against Prompt Hacking

There is no single fix for prompt hacking. Defence requires layered controls across the entire AI stack. The organisations that avoid breaches treat AI security as an ongoing practice, not a one-time setup.

Input validation and sanitisation. Filter and inspect all user-facing AI inputs before they reach the model. Strip known injection patterns. Reject inputs that attempt to override system instructions.
Strict role separation. Maintain clear boundaries between system prompts and user prompts. Use privilege separation so the model cannot access resources beyond what the current task requires.
Output filtering. Monitor AI outputs for signs of data leakage, prompt exposure, or behaviour outside expected parameters. Block responses that contain internal data markers or system prompt fragments.
AI red‑teaming. Regularly test AI deployments with adversarial attacks. Attempt prompt injection, jailbreaking, and data extraction against your own systems before attackers do. Measure the blast radius of each successful attack.
Human-in-the-loop controls. For high-risk agentic workflows, require human approval before the AI executes consequential actions. This limits the damage a successful injection can cause.
Continuous monitoring. Log and analyse AI interactions for anomalous patterns. Establish baselines for normal behaviour and alert on deviations. Treat AI systems with the same monitoring rigour as any other attack surface.

Enterprise AI Security Glossary

Term	Definition
Prompt Injection	Overriding an AI system's instructions by inserting malicious commands into user input. The number one LLM attack vector.
Jailbreaking	Bypassing a model's built-in safety guardrails through crafted prompts. A boardroom-level concern for enterprises deploying AI.
Promptware	Prompt injection delivered through documents, emails, or shared files. The attack surface is any content the AI processes.
The Lethal Trifecta	Coined by Simon Willison. The combination of private data access, action capability, and untrusted input exposure that creates maximum AI security risk.
Blast Radius	The scope of damage a compromised AI agent can cause. Determined by the permissions and data access the AI holds.
AI Red‑Teaming	Adversarially testing AI systems to find security weaknesses before attackers do.
Agentic Workflows	AI systems that can take autonomous actions such as sending emails, modifying files, or triggering business processes.
RAG	Retrieval-Augmented Generation. The standard enterprise architecture for connecting AI models to organisational data.
Grounding	Keeping AI outputs anchored to verified data sources. Critical for compliance in regulated industries.
Human‑in‑the‑Loop	Requiring human approval before an AI executes high-risk actions. A core control for regulated and enterprise environments.
LLM Orchestration	The architecture and tooling used to coordinate multiple AI models, data sources, and actions in enterprise deployments.
AI Governance	The policies, controls, and oversight structures an organisation puts in place to manage AI risk. Rapidly becoming a board-level priority.

Sources

Primary source: HackerOne, 9th Annual Hacker-Powered Security Report, March 2026. The report recorded a 540% surge in prompt injection vulnerability reports, identifying it as the fastest-growing attack vector.

Supporting sources:

HackerOne Blog, “3 Signals from the 2025 Report,” November 2025. Editorial breakdown of the annual report data.

Sonny Labs, “The 2025 Prompt Injection Threat Landscape.” Independent aggregation of key statistics from the HackerOne report and wider industry data.

Security Brief Asia, third-party news coverage of the HackerOne annual report findings.

Simon Willison's Weblog. Origin of the “lethal trifecta” framework for AI security risk assessment.

OWASP, “Top 10 for Large Language Model Applications.” The canonical enterprise reference for LLM security threats and terminology.