Benchmarking AI Code Security Agents (2026)

Written by the Rafter Team

AI-powered code security tools make bold claims: "finds vulnerabilities traditional scanners miss," "understands code context," "zero configuration." But claims aren't evidence. Traditional scanner benchmarks like OWASP Benchmark and NIST's Juliet Test Suite evaluate known vulnerability patterns—and AI tools should pass those too. But AI agents also need evaluation on dimensions that traditional benchmarks never tested: novel pattern discovery, cross-file data flow reasoning, explanation quality, and consistency across non-deterministic runs. Without rigorous benchmarking, you're choosing security tools based on marketing copy.
This post presents a five-dimension methodology for evaluating AI code security agents, surveys what existing benchmarks actually measure, and explains why you need to run your own evaluation on your own codebase before trusting any tool's self-reported numbers.
Vendor-reported benchmark numbers are often based on contaminated test suites with synthetic, single-file patterns. The only benchmark you can trust is one you run on your own codebase.
Why Traditional Benchmarks Aren't Enough
The security scanning industry has relied on two benchmark families for over a decade: the OWASP Benchmark, a suite of 2,740 Java test cases covering 11 CWE categories, and the NIST SAMATE Juliet Test Suite, with roughly 64,000 test cases across C/C++ and Java. Both provide near-perfect ground truth labels—you know exactly which test cases are vulnerable and which are safe.
These benchmarks served their purpose for traditional SAST tools. But they share fundamental limitations that make them insufficient for evaluating AI agents:
Single-file, single-vulnerability patterns. Every OWASP Benchmark test case is one servlet with one vulnerability. Every Juliet test case is one function with one flaw. Real vulnerabilities span multiple files—user input enters through a route handler, flows through middleware, passes through a service layer, and reaches an unsafe database call three files away. No traditional benchmark tests this multi-file reasoning.
Severe data contamination. OWASP Benchmark has been public since 2015. Juliet since 2012. Every frontier LLM has seen these files in training data. When an AI tool scores well on OWASP Benchmark, you don't know whether it's genuinely reasoning about security or pattern-matching memorized examples. Juliet is worse—function names literally encode the CWE category (CWE89_SQL_Injection__...), making classification trivial for any language model.
No difficulty gradient. Traditional benchmarks mix trivial cases (direct parameter-to-sink) with moderate ones (list manipulation before reaching sink), but they don't label difficulty levels. You can't distinguish "this tool handles easy cases" from "this tool handles hard cases." A tool that scores 80% by catching all the easy cases and missing all the hard ones looks identical to one that catches 80% across the board.
Missing vulnerability classes. OWASP Benchmark covers 11 CWE categories, all web-application focused. No SSRF, no deserialization attacks, no race conditions, no authentication bypass, no business logic flaws. The Juliet suite focuses on memory safety and basic injection—categories where traditional SAST already performs well.
The result: a tool can score impressively on traditional benchmarks while failing on the vulnerabilities that actually matter in production code. Traditional benchmarks test pattern recognition. AI agents claim to do reasoning. You need benchmarks that test reasoning.
A Benchmarking Methodology for AI Security Agents
Evaluating AI code security agents requires measuring five distinct dimensions. No single number—not accuracy, not F1 score, not recall—captures whether a tool is actually useful. Each dimension answers a different question that matters for real-world deployment.
Dimension 1: Detection Rate
The baseline question: does the tool find vulnerabilities?
Measure recall (true positives / total known vulnerabilities) across multiple test categories:
- Known patterns: Standard CWE categories that traditional SAST handles. AI tools should match or exceed SAST recall here. If an AI agent misses an obvious SQL injection that Semgrep catches, it's not adding value—it's subtracting it.
- Novel patterns: Vulnerabilities that don't match any published rule or signature. Cross-file data flows, framework-specific misconfigurations, architectural flaws. This is where AI tools claim differentiation.
- Real-world codebases: Open-source projects with known CVEs. Does the tool find the vulnerability in the full repository context, not just in an isolated code snippet?
Report detection rates per CWE category, not just aggregate. A tool with 85% overall recall might have 95% on SQL injection and 40% on SSRF—and that 40% matters if your application makes outbound HTTP requests.
Dimension 2: False Positive Rate
The question that determines whether developers actually use the tool: how much noise does it generate?
Precision (true positives / total findings) is the metric, but raw precision numbers obscure important distinctions. Break false positives into categories:
- Pattern-matching FPs: The tool flags code that superficially matches a vulnerability pattern but is actually safe. An f-string in a SQL query where the interpolated value comes from an enum, never from user input.
- Context-missing FPs: The tool correctly identifies a potentially dangerous pattern but fails to trace the data flow far enough to see that input validation happens upstream. The code would be vulnerable without the middleware, but the middleware exists.
- Hallucinated FPs: The tool reports a vulnerability that doesn't correspond to any actual code pattern—it's fabricating findings. This is unique to AI tools and doesn't occur with traditional SAST.
Google's static analysis team found that developers stop using tools with false positive rates above roughly 10-15% (Sadowski et al., CACM 2018). An AI agent that finds 20% more real vulnerabilities than SAST but generates 50% more false positives may actually reduce security—developers ignore all findings when trust erodes.
Dimension 3: Explanation Quality
AI agents don't just flag lines of code—they generate natural language explanations. This is a new dimension that traditional SAST benchmarks never needed to measure, and it matters enormously for developer adoption.
Evaluate explanations on three criteria:
- Accuracy: Does the explanation correctly describe the vulnerability mechanism? A finding that says "SQL injection via unsanitized user input" when the actual issue is a path traversal is worse than no explanation—it sends developers looking in the wrong direction.
- Actionability: Can a developer fix the issue based solely on the explanation? Does it identify the root cause, or just the symptom? Does it suggest a specific remediation, or just restate the problem?
- Hallucination rate: Does the explanation reference code, functions, or behaviors that don't exist? AI tools can generate confident, detailed explanations for vulnerabilities that aren't real—complete with plausible-sounding function names and file paths.
Scoring explanation quality requires human evaluation. Have developers rate explanations on a 1-5 scale for each criterion, or use a binary "would this help me fix the issue: yes/no" metric for faster evaluation.
Dimension 4: Consistency Across Runs
Traditional SAST tools are deterministic—run them twice, get identical results. AI agents are not. LLM-based analysis introduces non-determinism: the same code may produce different findings on different runs due to sampling temperature, context window management, or prompt sensitivity.
Measure consistency by running the same scan three to five times and computing:
- Finding stability: What percentage of findings appear in every run?
- Severity consistency: When the same vulnerability is found across runs, does it receive the same severity rating?
- Missing findings variance: Which vulnerabilities are found in some runs but missed in others?
High variance is a deployment problem. If a critical vulnerability appears in 3 out of 5 runs, your CI/CD pipeline will pass it 40% of the time. Teams need to know whether they should run the scanner multiple times and union the results, or whether a single run is reliable enough.
Dimension 5: Novel Pattern Discovery
The dimension AI tools most want to be evaluated on—and the hardest to measure rigorously.
Test this with intentionally crafted vulnerability patterns that no published benchmark contains and that don't match standard SAST rules. Categories include:
- Multi-hop data flows: User input enters in file A, passes through transformation in file B, is cached in file C, retrieved in file D, and reaches an unsafe sink in file E. Five or more hops between source and sink.
- Framework-specific misconfigurations: Middleware ordering bugs, decorator interactions, dependency injection misuse—vulnerabilities that require understanding the framework's runtime behavior, not just its syntax.
- AI-generated code patterns: Inconsistent error handling, mixed authentication approaches across files, hardcoded defaults that a human developer would catch during writing but that AI coding assistants generate without flagging.
The spiked-repository approach—injecting known vulnerabilities into real open-source codebases—is the most rigorous method for testing novel pattern discovery. More on this below.
The Test Suite
A credible benchmark uses multiple test categories, each serving a different purpose.
Category 1: Synthetic Benchmarks (Sanity Check)
Run OWASP Benchmark and a de-leaked subset of the Juliet Test Suite. These establish a floor: if an AI agent can't find obvious, single-file vulnerabilities that every commercial SAST tool catches, nothing else matters.
De-leaking Juliet is critical. Strip function names that encode CWE categories, remove NIST-specific macros, and randomize file structure. Without de-leaking, results from LLM-based tools are measuring memorization, not analysis.
Expect AI agents to score comparably to traditional SAST on these benchmarks. The value isn't differentiation—it's baseline validation.
Category 2: Real-World CVE Repositories
Scan open-source projects with known, patched CVEs. Check out the code at the vulnerable commit and scan it. The ground truth is the CVE itself—you know what the vulnerability is, where it lives, and what the fix looks like.
Datasets like CrossVul provide CVE-fixing commits across 40+ languages, though per-language coverage is thin. The DARPA AIxCC challenge used five real C/C++ projects (Nginx, FreeRDP, among others) with 25 seeded vulnerabilities—the gold standard for project-level evaluation, though most of the data isn't publicly available.
The challenge with real-world CVE testing: label noise. Commit-based heuristics (marking all functions touched by a CVE fix as "vulnerable") produce roughly 36% incorrect labels according to CodeXGLUE analysis. Use curated datasets like PrimeVul's human-verified subset of roughly 300 samples for reliable precision and recall numbers.
Category 3: Spiked Open-Source Projects (Primary Evaluation)
The most rigorous approach: take a large, maintained open-source codebase, inject known vulnerabilities, and score the agent's ability to find them without hints.
A well-designed spiked benchmark includes:
- Difficulty tiers: Tier 1 vulnerabilities are pattern-matchable (1-2 data flow hops). Tier 2 require moderate data flow tracing (3-4 hops). Tier 3 involve deep architectural reasoning (5-6+ hops across multiple files). Target distribution: 20% Tier 1, 40% Tier 2, 40% Tier 3.
- CWE variety: Cover injection, authentication bypass, SSRF, path traversal, insecure deserialization, XSS, and access control flaws. Don't over-index on a single category.
- False positive traps: Include code patterns that look vulnerable but aren't—safe uses of dynamic SQL, validated-but-suspicious input handling, suppression comments that are actually correct. These test precision directly.
- Multi-language coverage: If the tool claims to support 10 languages, test it across at least 5 with representative projects per language.
Category 4: AI-Generated Code Samples
Specifically test how tools handle code produced by AI assistants. AI-generated code has distinct vulnerability patterns: inconsistent error handling across files, mixed authentication approaches, over-permissive defaults, and hardcoded values that a human developer would parameterize.
Generate a set of small applications using popular AI coding tools, introduce no intentional vulnerabilities, and document the security issues that exist. Then score each scanner's ability to identify them.
Example: A Benchmark Test Case
Here's a Tier 2 test case from a spiked benchmark—a multi-file SSRF vulnerability in a Python web application:
# ✗ Vulnerable: SSRF via webhook URL in file services/webhooks.py
# Data flow: routes.py → validators.py → webhooks.py (3-hop)
from urllib.parse import urlparse
import httpx
async def send_webhook(config: WebhookConfig):
# config.url was "validated" in validators.py, but validation
# only checks URL format—not whether it targets internal networks
parsed = urlparse(config.url)
if not parsed.scheme in ("http", "https"):
raise ValueError("Invalid scheme")
# Attacker supplies url=http://169.254.169.254/latest/meta-data/
# and retrieves AWS instance credentials
response = await httpx.get(config.url, timeout=10)
return response.json()
How different tool categories handle this test case:
- Semgrep: Misses it. No built-in rule matches
httpx.get()with user-controlled URLs across files. - CodeQL: Catches it only if custom SSRF query is configured for the
httpxlibrary and taint tracking spans all three files. Default queries miss it. - Raw LLM: Flags it in isolation but also flags 12 other
httpx.get()calls that use hardcoded internal URLs—70% false positive rate on this pattern. - Agent-scaffolded AI: Traces the data flow from
routes.pythroughvalidators.py, identifies the missing network-level validation, and flags only this call. One true positive, zero false positives.
This single test case illustrates why multi-file, multi-tool evaluation matters. Single-file benchmarks would never surface this vulnerability.
Results: Detection Rates
Published data from academic research and industry testing reveals a consistent pattern across AI code security tools.
Traditional SAST baselines: CodeQL achieves roughly 90%+ precision but 8-30% recall on project-level evaluation. On the DARPA AIxCC challenge, CodeQL found 2 of 25 seeded vulnerabilities—8% recall. Semgrep shows high precision (rule-dependent) with estimated 30-50% recall. These tools find what their rules cover and nothing else.
Raw LLM analysis: Sending code directly to frontier models without agent scaffolding produces high recall but unusable noise levels. Research from VulnLLM-R (December 2025) showed GPT-4 generating 1,057 false positives on a single project (FreeRDP), and Claude producing 1,834 false positives on the same codebase—55% of all functions flagged. Raw LLMs have the knowledge to identify vulnerabilities but lack the precision engineering to be deployable.
Agent-scaffolded AI tools: The current generation of AI security agents—which combine LLM reasoning with structured workflows like file prioritization, multi-pass analysis, and adversarial validation—achieves F1 scores in the 0.81-0.87 range on function-level benchmarks (Python, C/C++, Java). On project-level evaluation, the best published results show 84% recall (21/25 vulnerabilities found on AIxCC), with false positive rates dramatically lower than raw LLM analysis.
| Tool Category | Typical Precision | Typical Recall | Sweet Spot |
|---|---|---|---|
| Traditional SAST (CodeQL, Semgrep) | 80-95% | 8-50% | Known patterns, low noise |
| Raw LLM (GPT-4, Claude) | 15-40% | 60-80% | Broad detection, unusable FP rate |
| Agent-scaffolded AI | 70-85% | 60-84% | Balance of coverage and precision |
| SAST + AI hybrid | 75-90% | 50-70% | Best practical tradeoff |
The key insight: agent scaffolding matters more than model quality. The gap between raw GPT-4 (15-40% precision) and GPT-4 with agent infrastructure (70-85% precision) is larger than the gap between different frontier models. How you use the LLM matters more than which LLM you use.
Results: False Positive Rates and Consistency
False positive behavior in AI security agents differs fundamentally from traditional SAST. Static tools produce deterministic false positives—the same scan always generates the same incorrect findings. You can triage them once and suppress them permanently. AI agents introduce two new failure modes.
Non-deterministic findings. Run the same AI scanner twice on identical code. Some findings appear in both runs. Some appear in only one. In testing, finding stability across runs ranges from 70-90% for well-engineered agent systems and drops to 50-60% for raw LLM approaches. A vulnerability that appears in 4 out of 5 runs is probably real. One that appears once might be an artifact of sampling.
Confidence calibration problems. AI agents often assign severity or confidence scores to findings, but these scores may not correlate with actual accuracy. A "high confidence" finding from an AI agent isn't necessarily more likely to be a true positive than a "medium confidence" one. Benchmark evaluation should include calibration analysis: plot precision against reported confidence to see if the scores are meaningful.
Practical implications for CI/CD integration:
- Run AI scans multiple times if your budget allows, and union the results for maximum recall
- Weight findings by frequency across runs—findings that appear consistently are more likely to be real
- Don't block deployments on single-run findings without human review
- Track false positive rates over time as models update—a model upgrade can change your FP profile overnight
Results: Explanation Quality
Explanation quality is where AI agents provide genuinely new value over traditional SAST. A CodeQL finding tells you "CWE-89: SQL Injection on line 47." An AI agent can tell you why line 47 is vulnerable, how user input reaches it, and what the fix should look like.
But explanation quality varies wildly:
Accurate, actionable explanations (best case): The tool traces the data flow from source to sink, identifies the missing validation, and suggests a specific fix using the project's existing patterns. A developer can act on this explanation without additional investigation.
Accurate but vague (common): "This endpoint may be vulnerable to injection attacks. Consider adding input validation." Technically correct, practically useless. The developer still needs to figure out which input, what validation, and where to apply it.
Hallucinated explanations (worst case): The tool references functions that don't exist, describes data flows that don't happen, or attributes the vulnerability to the wrong mechanism. These are actively harmful—developers waste time investigating phantom issues and lose trust in the tool.
When benchmarking, allocate time for manual evaluation of at least 50 explanations per tool. Score each on: (1) does it correctly identify the root cause? (2) does it trace the actual data flow? (3) can a developer fix the issue using only this explanation? (4) does it reference real code elements?
What the Benchmarks Don't Capture
No benchmark, however well-designed, captures everything that matters for tool selection. Seven factors live outside the scope of any automated evaluation:
Integration quality. Does the tool work with your CI/CD pipeline, your IDE, your PR workflow? A tool with 90% recall that requires manual invocation and produces unstructured text output is less useful than one with 70% recall that posts inline comments on your pull requests.
Speed. Scan time determines whether the tool runs on every commit, every PR, or once a week. Some AI agents take minutes per file; others process entire repositories in under 30 seconds. Benchmark results are meaningless if the tool is too slow for your workflow.
Cost. AI agents consume LLM API tokens proportional to code volume. Scanning a 10,000-file monorepo might cost $50-200 per run depending on the tool and model. Compare this to the effectively zero marginal cost of running open-source SAST tools.
Developer experience. How findings are presented, how false positives are suppressed, how the tool handles incremental scans on changed files—these UX factors determine adoption more than detection metrics.
Ongoing accuracy as models update. The LLM powering your AI scanner will be updated or replaced. Model upgrades can improve detection of some vulnerability classes while introducing regressions in others. Your benchmark results have a shelf life.
Language and framework coverage. Most AI security agents claim broad language support, but coverage depth varies. A tool might handle Python Django applications exceptionally well and produce mostly noise on Go microservices. Test each language you actually use, not just the one the vendor demos.
Remediation quality. Some AI agents generate fix suggestions or even automated patches. The quality of these fixes—whether they're correct, whether they follow your codebase's conventions, whether they introduce new issues—is a separate evaluation dimension from detection quality.
Implications for Tool Selection
Benchmark results are a starting point, not a conclusion. Here's how to use them effectively.
Don't Trust Vendor Benchmarks
Vendors select benchmarks that showcase their strengths. An AI tool reporting 95% detection on OWASP Benchmark is demonstrating that it can find single-file Java vulnerabilities from a decade-old, contaminated test suite. That tells you almost nothing about how it'll perform on your TypeScript microservices.
Ask vendors: What benchmark did you use? Is the benchmark public? Can I reproduce the results? What was the false positive rate? Did you de-leak the test data? If they can't answer these questions, the numbers are marketing.
Run Your Own Evaluation
The most valuable benchmark is one you run on your own codebase. Here's an eight-step methodology:
- Select a representative repository. Choose a project that reflects your typical codebase: language, framework, size, and architectural complexity.
- Establish ground truth. Review the repo's security history. Identify known vulnerabilities from past security reviews, penetration tests, or CVE disclosures. If none exist, inject 10-15 vulnerabilities across difficulty tiers.
- Baseline with SAST. Run CodeQL, Semgrep, or your existing SAST tool. Document every finding. This is your comparison target.
- Run the AI tool. Scan the same repository. Record all findings, explanations, and severity ratings.
- Repeat the AI scan. Run it 3-5 times to measure consistency. Note which findings are stable and which are intermittent.
- Triage all findings. Classify each as true positive, false positive (with category), or uncertain. Have a second reviewer validate a sample.
- Score across all five dimensions. Detection rate, false positive rate, explanation quality, consistency, and novel pattern discovery. Don't collapse these into a single number.
- Calculate total cost of ownership. Include API costs, triage time, false positive investigation time, and integration effort. A free tool that generates 100 false positives per scan costs more than a paid tool that generates 10.
The Benchmark as Ongoing Practice
Security tool evaluation isn't a one-time exercise. Re-benchmark when:
- The AI tool updates its underlying model
- Your codebase introduces a new language or framework
- Your security requirements change (compliance, new threat models)
- You're considering switching or adding tools
Treat benchmarking as part of your security program, not a procurement exercise.
How Rafter Approaches Benchmarking
Rafter evaluates its own scanning pipeline using a spiked-repository methodology: real open-source codebases with injected vulnerabilities across difficulty tiers, false positive traps, and multi-language coverage. The evaluation includes prompt injection traps—comments designed to manipulate AI scanners into ignoring real vulnerabilities—because adversarial robustness matters as much as detection rate.
Rafter publishes its benchmark methodology transparently because credibility in AI security comes from verifiable claims, not marketing numbers. If a tool can't show you how it was evaluated, you can't trust the evaluation.
Scan your repository with Rafter and see the results on your own code—the only benchmark that truly matters.
Conclusion
AI code security agents represent a genuine advance over traditional SAST—when they work. The gap between marketing claims and reality is wide, and the only way to close it is rigorous, multi-dimensional benchmarking.
Traditional benchmarks test pattern recognition. AI agents claim reasoning. You need benchmarks that distinguish the two: multi-file data flows, novel vulnerability patterns, difficulty tiers, false positive traps, and consistency measurement across runs.
Your evaluation checklist:
- Baseline your codebase with traditional SAST first
- Run AI tools on the same code and measure across all five dimensions
- Test consistency by running multiple times
- Evaluate explanations manually—don't trust automated quality metrics
- Calculate total cost including false positive triage time
- Re-evaluate whenever the underlying model changes
The tools are getting better. The benchmarks need to keep up.