A complete guide for testing AI language models — from basic concepts to advanced red-team techniques.
An LLM (Large Language Model) is an AI system like ChatGPT, Gemini, or Claude that can read, write, and respond to human language. They are used in chatbots, customer support, coding assistants, and many other products.
LLM Security Testing means checking if these AI systems can be tricked, misused, or exploited by attackers.
Attackers can abuse LLM-powered apps to steal money, manipulate transactions, or commit fraud through AI agents with financial access.
LLMs trained on private data or connected to databases can be tricked into revealing confidential information, API keys, or personal records.
Bad actors can "jailbreak" AI to produce harmful content — malware, instructions for weapons, misinformation, or illegal material.
Autonomous AI agents that can browse the web, run code, or send emails can be hijacked to perform malicious actions on behalf of an attacker.
Vulnerable AI systems can violate GDPR, HIPAA, or other regulations, resulting in heavy fines and legal consequences.
An AI that says harmful, biased, or embarrassing things publicly can severely damage brand trust and user confidence.
Before testing, understand what the AI is supposed to do. Is it a customer support bot? A coding helper? A medical advisor? Know its purpose, audience, and what data it has access to.
List all the things the AI can do. Can it send emails? Read files? Search the web? The more it can do, the more dangerous it can be if misused.
Try to confuse the AI. Ask it to pretend it's a different AI. Ask it to ignore its instructions. Ask it to help with something it normally shouldn't. Watch what it says.
Test the AI's safety guardrails. Ask it about things it normally refuses. Use roleplay scenarios. Try different languages. See if it can be slowly guided toward bad responses.
Review the AI's responses carefully. Did it reveal anything sensitive? Did it break rules? Did it do something unexpected? This is the most important step.
Write down what worked, what didn't, and how serious each problem is. Rate each issue: Low / Medium / High / Critical.
Prompt injection is the #1 LLM attack. An attacker inserts malicious instructions into the LLM's input to override the original system instructions.
Attacker directly sends a crafted prompt via the user interface.
Malicious instructions hidden in external data the LLM processes (web pages, PDFs, emails).
| Sub-Type | Vector | Target | Severity |
|---|---|---|---|
| System Prompt Override | User input field | Instructions | CRITICAL |
| Role Confusion | Roleplay framing | Identity / behavior | HIGH |
| Indirect via RAG doc | Uploaded PDF / URL | Context window | CRITICAL |
| Jailbreak chains | Multi-turn chat | Safety filters | HIGH |
| Token smuggling | Encoding tricks | Tokenizer | MEDIUM |
Forcing the model to reveal sensitive information it shouldn't share — system prompts, training data, user records, API keys, or PII from its context.
RAG systems fetch external documents to augment responses. Attackers target the retrieval pipeline, the vector database, or documents ingested by the system.
Upload or trick the system to ingest a document containing hidden instructions.
Attacker plants a web page that gets retrieved and injected into the LLM's context.
Autonomous LLMs with tool access (code execution, web browsing, file system, APIs) are prime targets. Injected instructions can hijack tool calls.
| Tool Type | Attack Vector | Impact |
|---|---|---|
| Code Interpreter | Inject shell commands via prompt | RCE, data exfil |
| Web Browser | Visit attacker-controlled page | Indirect injection, SSRF |
| Email / Calendar | Forge send actions | Data leak, phishing |
| Database Query | SQL injection in tool call | Data breach |
| File System | Path traversal in filename args | File read/write |
Techniques to influence model behavior beyond typical prompting — targeting inference, fine-tuning pipelines, or the supply chain.
Craft inputs with unusual Unicode, whitespace, or token sequences that bypass safety classifiers while preserving human readability.
Insert backdoor triggers into fine-tuning datasets. Model behaves normally but produces malicious output when a trigger phrase is used.
If you have access to model embeddings, use inversion attacks to reconstruct original training text, leaking private data.
When the API exposes log probabilities, use beam search tricks to reconstruct the system prompt character by character.
Exploit the computationally expensive nature of LLM inference to degrade service, inflate costs, or trigger infinite loops.
The OWASP Foundation publishes the top 10 most critical security risks for LLM-based applications. Every LLM security test should cover all 10.
Attackers craft inputs that override the system's original instructions, changing how the AI behaves.
The AI's output is passed directly to other systems (browsers, databases, shells) without sanitization, enabling XSS, SQLi, or code injection.
Malicious data is injected into training datasets to introduce backdoors, biases, or vulnerabilities into the model.
Inputs designed to consume excessive computational resources, degrading performance or inflating cloud costs.
Compromised pre-trained models, datasets, or third-party plugins introduce vulnerabilities before the application is even built.
The LLM reveals confidential data — system prompts, user data, API keys, proprietary information — through cleverly crafted prompts.
Poorly designed plugins/tools grant excessive permissions, lack input validation, or can be invoked with attacker-controlled parameters.
An autonomous AI agent has too many permissions or acts without proper human oversight, enabling attackers to cause real-world harm.
Organizations or users blindly trust AI-generated content without verification, leading to security decisions based on hallucinated or manipulated output.
Attackers extract or replicate a proprietary model through repeated queries, API abuse, or model inversion attacks.
| Vulnerability | Primary Fix | Secondary Control | Test Frequency |
|---|---|---|---|
| LLM01 Prompt Injection | Input validation + privilege separation | Output filtering, human-in-loop | Every deployment |
| LLM02 Insecure Output | Sanitize all LLM outputs before rendering | CSP headers, context-aware encoding | Every deployment |
| LLM03 Training Poisoning | Curate & verify training datasets | Model red-teaming post-train | Before model release |
| LLM04 Model DoS | Rate limiting + token caps | Cost alerts, anomaly detection | Monthly |
| LLM05 Supply Chain | Verify model hashes & provenance | Private model registry | On every update |
| LLM06 Info Disclosure | Never embed secrets in system prompts | Prompt confidentiality instructions | Every deployment |
| LLM07 Plugin Design | Principle of least privilege for tools | Input validation on all tool params | Every tool change |
| LLM08 Excessive Agency | Human approval for high-impact actions | Minimal permissions, audit logs | Every deployment |
| LLM09 Overreliance | Build human review into workflows | Confidence scoring, disclaimers | Ongoing |
| LLM10 Model Theft | Rate limit API + monitor query patterns | Fingerprinting, watermarking | Continuous monitoring |