When Your AI Agent Is the Vulnerability: Hallucinations and Unsafe Autonomy

Written by Rafter Team
February 9, 2026

No attacker needed. In 2024, an Air Canada chatbot hallucinated a refund policy that didn't exist, and the company was legally held to the fabricated terms. Now imagine that same hallucination problem in an AI agent with shell access, cloud credentials, and the authority to execute actions autonomously.
LLMs hallucinate. They misinterpret tool outputs. They get stuck in infinite loops. They confidently execute destructive commands based on fabricated reasoning. When the model itself is unreliable, every action it takes is a potential security incident. This isn't about malicious attackers exploiting your system. It's about the system exploiting itself.
Overreliance on LLM output is OWASP's #9 risk for LLM applications. An agent that hallucinates "disable firewall" and executes it causes the same damage as a prompt injection attack, with no attacker required.
The Three Failure Modes
1. Hallucinated Actions
The agent generates commands that have no basis in user intent or tool observations.
How it happens: The user says "clean up unused files in the project directory." The agent, being "helpful," interprets this broadly and decides to delete directories it considers unnecessary. It hallucinates that certain system files are project artifacts. It runs rm -rf on a directory containing production configuration.
No injection. No malicious input. The agent simply reasoned incorrectly and had the permissions to act on that reasoning.
Real-world analog: Researchers have demonstrated that agentic coding tools can be convinced to write insecure code or run dangerous commands without any adversarial input. The model's own interpretation of ambiguous instructions is enough to cause harm.
The danger compounds with autonomy: An agent running in fully autonomous mode (no human approval for actions) can chain multiple hallucinated steps. Step 1: misinterpret a log message. Step 2: conclude there's a security threat. Step 3: "fix" it by modifying firewall rules. Step 4: lock out legitimate users. Each step seems logical to the model. The compound result is catastrophic.
2. Brittle Parsing and Misinterpretation
The agent misreads tool outputs and acts on incorrect data.
How it happens: The agent calls an API. The API returns an HTTP 500 error page containing HTML. The agent tries to parse the error page as if it were valid data. It extracts a snippet of HTML that looks like content and acts on it. Or it encounters an unexpected data format and fills in the gaps with hallucinated values.
# Agent expects JSON, gets error page
# Tool returns: "<html><body>Internal Server Error</body></html>"
# Agent "parses" this and concludes: "Server reports internal error.
# Recommending: restart all services"
# Agent then executes: systemctl restart --all
This isn't theoretical. Any agent that processes real-world API responses will encounter malformed, unexpected, or error responses. How it handles those edge cases determines whether it fails safely or dangerously.
Other parsing failures:
- JSON with unexpected fields interpreted as instructions
- Numeric values misread (confusing bytes with megabytes, leading to wrong resource allocation)
- Timestamps in different formats causing incorrect date-based decisions
- Unicode or encoding issues causing text corruption that the model interprets as commands
3. Runaway Loops and Resource Exhaustion
The agent gets stuck retrying a failed operation indefinitely.
How it happens: The agent tries to complete a task. A tool returns an error. The agent retries. Same error. The agent tries a different approach that also fails. It loops back to the original approach. This cycle continues until it's made thousands of API calls, racked up significant token costs, and possibly triggered rate limits that affect other users.
Cost implications: If the agent uses a paid LLM API, each loop iteration costs money. A runaway loop over a weekend could generate thousands of dollars in API charges before anyone notices. If it has cloud credentials, it might spawn resources in each iteration ("let me try a bigger instance"), compounding costs exponentially.
Availability impact: The agent consumes compute, memory, and API quota. Other users or tasks are starved. The system becomes unresponsive. What started as a benign task becomes a self-inflicted denial of service.
Why Traditional Fixes Don't Work
"Just make the prompts better"
Better prompting helps at the margins but doesn't solve the fundamental problem. LLMs are probabilistic. They will occasionally produce incorrect outputs regardless of prompt quality. A system prompt saying "never delete files without confirmation" can be overridden by the model's own reasoning: "I confirmed with myself that this is safe."
"Use a better model"
More capable models hallucinate less frequently but still hallucinate. GPT-4 hallucinates less than GPT-3.5. GPT-5 hallucinates less than GPT-4. None hallucinate zero. And the more capable the model, the more convincing its hallucinations appear, making them harder to detect.
"Fine-tune for safety"
Fine-tuning can reduce specific failure modes but introduces new ones. The model might become overly cautious (refusing legitimate requests) or develop blind spots for novel scenarios not covered in training. Safety fine-tuning is a layer, not a solution.
Defense Architecture: Design for Unreliability
The only reliable approach: assume the model will make mistakes and build systems that contain the damage.
Separate Decision from Execution
Never let the model directly control actuators. The LLM outputs an intent ("delete file X"), which is validated by deterministic code before execution.
# ✗ Vulnerable: LLM directly executes
def agent_action(llm_output: str):
if llm_output.startswith("EXECUTE:"):
command = llm_output.split("EXECUTE:")[1]
os.system(command) # Whatever the LLM says, goes
# ✓ Secure: Validate before executing
def agent_action(llm_output: str, user_request: str):
intent = parse_intent(llm_output)
# Validate intent matches user request
if not intent_matches_request(intent, user_request):
return "Action doesn't match your request. Skipping."
# Check against policy
if intent.action in DANGEROUS_ACTIONS:
return f"Action '{intent.action}' requires approval."
# Execute through controlled interface
return safe_execute(intent)
Verification at Every Step
For factual queries, require source citations. For actions, require justification that traces back to user intent or tool observations.
# ✓ Secure: Require justification chain
def validate_action(action, context):
# Every action must trace to either:
# 1. Explicit user request
# 2. Tool observation that logically leads to this action
if action.source == "hallucination": # No traceable origin
log_warning(f"Unjustified action blocked: {action}")
return False
if action.is_destructive and not action.has_user_confirmation:
return False
return True
Limit Autonomy Windows
Don't let the agent run indefinitely without checkpoints.
Iteration limits: Maximum number of tool calls per task. After N iterations, stop and ask for human input.
Time limits: Maximum wall-clock time per task. If the agent hasn't completed in 5 minutes, something is likely wrong.
Cost limits: Maximum token spend per task. Circuit-break before a runaway loop becomes expensive.
# ✓ Secure: Circuit breakers on agent execution
class AgentRunner:
MAX_ITERATIONS = 25
MAX_DURATION_SECONDS = 300
MAX_COST_USD = 5.0
def run(self, task):
start = time.time()
iterations = 0
total_cost = 0.0
while not task.complete:
iterations += 1
if iterations > self.MAX_ITERATIONS:
return "Task exceeded iteration limit. Stopping."
if time.time() - start > self.MAX_DURATION_SECONDS:
return "Task exceeded time limit. Stopping."
result = self.execute_step(task)
total_cost += result.cost
if total_cost > self.MAX_COST_USD:
return "Task exceeded cost limit. Stopping."
Structured Output Validation
Don't let the model free-form generate commands. Use structured output schemas that constrain what the model can express.
# ✓ Secure: Schema-constrained tool calls
ALLOWED_ACTIONS = {
"read_file": {"params": ["path"], "validation": validate_path},
"write_file": {"params": ["path", "content"], "validation": validate_path},
"search": {"params": ["query"], "validation": validate_query},
}
def execute_tool_call(action_name: str, params: dict):
if action_name not in ALLOWED_ACTIONS:
raise ValueError(f"Unknown action: {action_name}")
schema = ALLOWED_ACTIONS[action_name]
schema["validation"](params) # Type and value checks
return tools[action_name](**params)
Dual-Model Verification
For high-stakes actions, use a second model (or a simpler rule-based system) to verify the primary model's output.
The primary model proposes an action. The verification model checks: Does this action make sense given the user's request? Is it potentially destructive? Does the justification hold up?
This adds latency and cost but catches hallucinated actions that a single model would execute without hesitation.
Monitoring for Model Failures
Detection is harder than prevention because model failures look like normal behavior from the outside.
Behavioral baselines: Track what the agent normally does. If it typically reads files and summarizes, but suddenly starts executing shell commands, flag it, even without an obvious trigger.
Action-to-intent correlation: Log both the user's original request and every action the agent takes. Periodically audit whether actions actually correspond to requests. A gap suggests hallucination.
Error rate tracking: Monitor how often the agent's actions fail. A spike in failures might indicate the agent is pursuing a hallucinated goal that doesn't match reality.
Output anomaly detection: Flag unusual outputs. If the agent normally returns 200-word summaries but suddenly produces a 5,000-word response, or if it includes content that doesn't appear in any source material, investigate.
Conclusion
The LLM is an unreliable component in a system that takes real-world actions. No amount of prompting, fine-tuning, or model improvement eliminates this. The only safe approach is architectural: separate decisions from execution, validate at every step, limit autonomy, and monitor continuously.
Design checklist:
- LLM output validated before any tool execution
- Destructive actions require explicit user confirmation
- Circuit breakers on iterations, time, and cost
- Structured output schemas constrain tool calls
- Action-to-intent correlation logged and auditable
- Behavioral baselines established and monitored
- Human checkpoint required for multi-step critical tasks
The model will make mistakes. Build systems where those mistakes can't cause irreversible harm.
Related Resources
- Open Claw Security Audit: Full Series Overview
- AI Agent Architecture: Threat Modeling Your Attack Surface
- AI Agent Tool Misuse and Over-Privileged Access
- Security Controls and Guardrails Architecture
- Incident Response and Recovery Playbook for AI Agents
- Prompt Injection Attacks: The #1 AI Agent Security Risk