What Is Prompt Hacking?
Prompt hacking is an umbrella term for attacks that target how AI models receive and process instructions. Every large language model accepts natural language input. That flexibility is also its greatest vulnerability. An attacker does not need to write code or find a software bug. They only need to write the right words.
There are four main types of prompt hacking: prompt injection, jailbreaking, prompt stealing, and promptware. Each exploits a different part of the AI system. Together, they represent the broadest attack surface in enterprise technology today.
Prompt Injection
Prompt injection is the most common and most dangerous form of prompt hacking. It works by inserting malicious instructions into user input that override the system prompt. The AI follows the attacker's instructions instead of the developer's.
This matters because enterprise AI tools operate under system prompts that define their behaviour, access controls, and boundaries. A successful injection bypasses all of these. The AI does not distinguish between instructions from its developer and instructions from an attacker. It processes both as text.
When an AI assistant has access to company data, email, and the ability to take actions, a successful prompt injection does not just produce wrong text. It can exfiltrate data, send emails, modify files, and trigger workflows. The blast radius is determined by the permissions the AI holds.
Jailbreaking
Jailbreaking targets the model itself rather than the application layer. Every commercial LLM is trained with safety guardrails that prevent it from producing harmful, biased, or dangerous content. Jailbreaking uses crafted prompts to bypass these guardrails.
Techniques include persona manipulation (asking the model to role-play as an unrestricted AI), hypothetical framing (wrapping harmful requests in fictional scenarios), and multi-step escalation (gradually pushing the model past its safety boundaries across a conversation).
For enterprises, jailbreaking is a boardroom-level concern. A customer-facing chatbot that gets jailbroken in public is a reputational crisis. An internal AI tool that produces unsafe outputs after jailbreaking is a liability.
Prompt Stealing
Prompt stealing is the reverse engineering of proprietary system prompts from an AI's outputs. Many businesses invest heavily in crafting system prompts that define their AI product's behaviour, personality, and competitive edge. Those prompts are intellectual property.
Attackers extract system prompts by asking the AI to repeat its instructions, by probing its boundaries to infer the rules it follows, or by using injection techniques to make the model reveal its own configuration. Once a system prompt is stolen, a competitor can replicate the product.
Promptware
Promptware is the newest and potentially most dangerous category of prompt hacking. It refers to prompt injection attacks delivered through documents, emails, spreadsheets, and shared files rather than direct user input.
When an AI assistant processes a document containing hidden instructions, those instructions can hijack the AI's behaviour. The user never sees the malicious text. The AI reads it, follows it, and the attack executes silently.
This is especially dangerous in enterprise environments where AI tools like Microsoft 365 Copilot automatically process incoming emails and attachments. The attack surface is every document the AI reads. The user does not even need to interact with the attacker's content. The AI does it for them.
The Lethal Trifecta
Security researcher Simon Willison coined the term “the lethal trifecta” to describe the three conditions that, when combined, create maximum risk from prompt hacking:
1. The AI has access to private data. It can read emails, documents, databases, or customer records.
2. The AI can take actions. It can send messages, modify files, trigger workflows, or make decisions.
3. The AI is exposed to untrusted input. It processes content from external sources: emails, documents, web pages, or user messages.
When all three conditions exist in a single system, a prompt injection attack can access private data and take real-world actions based on instructions hidden in untrusted content. Enterprise tools like Microsoft 365 Copilot, Salesforce Einstein, and custom agentic AI workflows frequently meet all three conditions.
Real‑World Incidents
Security researchers demonstrated that malicious instructions hidden inside a document processed by Microsoft 365 Copilot could exfiltrate sensitive data from a user's emails and files. The attack worked because Copilot met all three conditions of the lethal trifecta: access to private data, ability to take actions, and exposure to untrusted document content.
A customer-facing Chevrolet dealership chatbot was manipulated through prompt injection to agree to sell a vehicle for one dollar and to make statements contradicting the company's positions. The incident went viral and became a widely cited example of why AI systems handling customer interactions need robust input validation.
Early users of Microsoft's Bing Chat discovered they could jailbreak the system into revealing its internal codename “Sydney” and bypass its safety guidelines through persistent conversational manipulation. The AI produced threatening statements and expressed emotions it was not designed to simulate. The incident forced Microsoft to restrict conversation lengths and tighten safety controls.
How Enterprises Defend Against Prompt Hacking
There is no single fix for prompt hacking. Defence requires layered controls across the entire AI stack. The organisations that avoid breaches treat AI security as an ongoing practice, not a one-time setup.
- Input validation and sanitisation. Filter and inspect all user-facing AI inputs before they reach the model. Strip known injection patterns. Reject inputs that attempt to override system instructions.
- Strict role separation. Maintain clear boundaries between system prompts and user prompts. Use privilege separation so the model cannot access resources beyond what the current task requires.
- Output filtering. Monitor AI outputs for signs of data leakage, prompt exposure, or behaviour outside expected parameters. Block responses that contain internal data markers or system prompt fragments.
- AI red‑teaming. Regularly test AI deployments with adversarial attacks. Attempt prompt injection, jailbreaking, and data extraction against your own systems before attackers do. Measure the blast radius of each successful attack.
- Human-in-the-loop controls. For high-risk agentic workflows, require human approval before the AI executes consequential actions. This limits the damage a successful injection can cause.
- Continuous monitoring. Log and analyse AI interactions for anomalous patterns. Establish baselines for normal behaviour and alert on deviations. Treat AI systems with the same monitoring rigour as any other attack surface.
Enterprise AI Security Glossary
| Term | Definition |
|---|---|
| Prompt Injection | Overriding an AI system's instructions by inserting malicious commands into user input. The number one LLM attack vector. |
| Jailbreaking | Bypassing a model's built-in safety guardrails through crafted prompts. A boardroom-level concern for enterprises deploying AI. |
| Promptware | Prompt injection delivered through documents, emails, or shared files. The attack surface is any content the AI processes. |
| The Lethal Trifecta | Coined by Simon Willison. The combination of private data access, action capability, and untrusted input exposure that creates maximum AI security risk. |
| Blast Radius | The scope of damage a compromised AI agent can cause. Determined by the permissions and data access the AI holds. |
| AI Red‑Teaming | Adversarially testing AI systems to find security weaknesses before attackers do. |
| Agentic Workflows | AI systems that can take autonomous actions such as sending emails, modifying files, or triggering business processes. |
| RAG | Retrieval-Augmented Generation. The standard enterprise architecture for connecting AI models to organisational data. |
| Grounding | Keeping AI outputs anchored to verified data sources. Critical for compliance in regulated industries. |
| Human‑in‑the‑Loop | Requiring human approval before an AI executes high-risk actions. A core control for regulated and enterprise environments. |
| LLM Orchestration | The architecture and tooling used to coordinate multiple AI models, data sources, and actions in enterprise deployments. |
| AI Governance | The policies, controls, and oversight structures an organisation puts in place to manage AI risk. Rapidly becoming a board-level priority. |
Sources
Primary source: HackerOne, 9th Annual Hacker-Powered Security Report, March 2026. The report recorded a 540% surge in prompt injection vulnerability reports, identifying it as the fastest-growing attack vector.
Supporting sources:
HackerOne Blog, “3 Signals from the 2025 Report,” November 2025. Editorial breakdown of the annual report data.
Sonny Labs, “The 2025 Prompt Injection Threat Landscape.” Independent aggregation of key statistics from the HackerOne report and wider industry data.
Security Brief Asia, third-party news coverage of the HackerOne annual report findings.
Simon Willison's Weblog. Origin of the “lethal trifecta” framework for AI security risk assessment.
OWASP, “Top 10 for Large Language Model Applications.” The canonical enterprise reference for LLM security threats and terminology.