Static Analysis: How SAST Finds Vulnerabilities in Your Code

Static analysis tools don't run your code—they read it. But "reading code" undersells what actually happens. A modern SAST tool parses your source into an abstract syntax tree, builds control flow and data flow graphs, traces untrusted input through every possible execution path, and matches patterns against thousands of known vulnerability signatures. This pipeline is why SAST catches certain bugs instantly and is completely blind to others.

Understanding how static analysis works under the hood isn't academic curiosity. It directly affects how you write code, how you interpret scanner results, and whether you trust or dismiss the findings your tools produce. Developers who understand the internals write more scannable code, triage faster, and waste less time on false positives.

Static analysis examines source code without executing it. It's the most widely deployed form of automated security testing, running in CI/CD pipelines at companies from solo-dev startups to enterprises with thousands of repositories.

Parsing: From Source Code to Abstract Syntax Tree

Every static analysis scan starts the same way: parsing. The tool reads your source files and converts raw text into a structured representation called an abstract syntax tree (AST).

An AST strips away syntactic noise—whitespace, comments, semicolons—and captures the logical structure of your code. Each node in the tree represents a language construct: a function declaration, a variable assignment, an if statement, a function call.

Consider this Express.js route:


// ✗ Vulnerable: unsanitized user input in SQL query
app.get('/users', (req, res) => {
  const userId = req.query.id;
  const query = `SELECT * FROM users WHERE id = '${userId}'`;
  db.query(query, (err, results) => {
    res.json(results);
  });
});

The parser converts this into a tree where req.query.id is a MemberExpression node, the template literal is a TemplateLiteral node containing that expression, and db.query is a CallExpression receiving the concatenated string. The scanner doesn't see text—it sees structure.

This is also why multi-language support is hard. Every language needs its own parser. JavaScript's AST looks nothing like Python's, which looks nothing like Go's. Tools like Semgrep solve this with language-agnostic pattern matching on top of language-specific parsers—you write one rule pattern and it matches across languages. CodeQL takes a different approach, compiling code into a relational database that you query with a SQL-like language.

The parsing stage is also where the first class of bugs gets caught: syntax-level patterns. Hardcoded secrets, use of banned functions (eval(), exec()), and dangerous API calls can be flagged purely from the AST, without any deeper analysis. These pattern-match rules are fast and cheap—but they're also the shallowest layer of what SAST can do.

Control Flow Graphs: Mapping Every Execution Path

After parsing, the scanner builds a control flow graph (CFG)—a directed graph where each node is a basic block of sequential statements and each edge represents a possible transfer of control: an if branch, a loop iteration, an exception throw, a function return.

The CFG answers a critical question: which paths through the code are actually possible?

In the diagram above, only the path through node D leads to a vulnerability. If the scanner can determine that the "No" branch is reachable—meaning input validation doesn't always occur—it flags the finding. If every path goes through the sanitizer, the scanner suppresses the alert.

Control flow analysis is where the gap between cheap and sophisticated tools widens. A basic regex scanner can't reason about branches. A CFG-aware tool can distinguish between code that sometimes sanitizes input and code that always sanitizes it. This distinction is the difference between a true positive and a false positive.

Loops add complexity. A variable might be safe on the first iteration but tainted on the second (if the loop body reassigns it from user input). Exception handlers create implicit edges—when db.query throws, control jumps to a catch block that might log the raw query, including the unsanitized input. The CFG captures all of this.

Data Flow Analysis: Following the Data

Control flow tells the scanner where code can go. Data flow analysis tells it what values variables hold at each point.

There are two directions:

Forward analysis tracks a value from where it's defined to where it's used. "This variable was assigned req.query.id on line 3—where does it end up?"
Backward analysis starts from a dangerous operation and asks what values could reach it. "db.query() is called on line 4—where did its argument come from?"

Modern SAST tools combine both. Forward analysis propagates taint (more on that below). Backward analysis prunes impossible paths and reduces false positives.

Data flow analysis also tracks transformations. If userId passes through parseInt(), the scanner knows the value is now a number—it can't contain SQL injection payload. If it passes through string concatenation, the taint persists. If it passes through a parameterized query (db.query('SELECT * FROM users WHERE id = ?', [userId])), the scanner recognizes the sink as safe and suppresses the finding.

The challenge is interprocedural analysis—tracking data across function boundaries. When userId gets passed to buildQuery(userId) in another file, the scanner needs to follow that call, analyze the called function, and carry the taint information back. This is computationally expensive and is where many free or lightweight tools give up, limiting their analysis to single functions or single files.

Taint Analysis: The Core of Vulnerability Detection

Taint analysis is the engine that powers most SAST vulnerability detection. It formalizes a simple question: can untrusted data reach a dangerous operation without being sanitized?

Three concepts define every taint rule:

Concept	Definition	Examples
Source	Where untrusted data enters	`req.query`, `req.body`, `req.params`, `process.env`, file reads, API responses
Sink	Where data becomes dangerous	`db.query()`, `eval()`, `exec()`, `res.send()`, `fs.writeFile()`, `child_process.spawn()`
Sanitizer	What neutralizes the taint	`parameterized queries`, `encodeURIComponent()`, `DOMPurify.sanitize()`, `parseInt()`

A finding is generated when the scanner traces a path from a source to a sink with no sanitizer in between.

Here's the full taint trace for our vulnerable Express.js route:


Source:    req.query.id                    [line 3]  — user-controlled input
  ↓
Propagator: template literal concatenation  [line 4]  — taint preserved
  ↓
Sink:      db.query(query)                 [line 5]  — SQL execution

No sanitizer found on any path from source to sink.

Finding: SQL Injection (CWE-89)
Severity: HIGH
Confidence: HIGH

Now compare with the fixed version:


// ✓ Secure: parameterized query prevents SQL injection
app.get('/users', (req, res) => {
  const userId = req.query.id;
  db.query('SELECT * FROM users WHERE id = ?', [userId], (err, results) => {
    res.json(results);
  });
});

The scanner traces the same source (req.query.id) but recognizes that the parameterized query API (? placeholder with array binding) is a registered sanitizer for SQL injection. The taint is neutralized. No finding generated.

This is why writing parameterized queries matters beyond just "best practice"—it's how your scanner determines whether to flag or suppress a finding. Code that uses string concatenation for queries will generate findings on every scan, even if you've manually validated input elsewhere in ways the scanner can't verify.

How Sanitizer Recognition Works in Practice

Scanners don't magically know that a function sanitizes input—they rely on a registry of known-safe patterns. Semgrep's taint mode, for example, requires explicit pattern-sanitizers declarations in each rule. CodeQL has a library of sanitizer predicates for common frameworks. If you use a custom wrapper function—say, myApp.safeQuery(sql, params)—the scanner has no idea it's safe unless you write a custom rule telling it so.

This is the most common source of false positives in taint analysis: the scanner sees unsanitized data flow into a sink, but the actual sanitization happens inside a helper function it hasn't been told to trust. Conversely, it's also a source of false negatives: if you call myApp.unsafeEval() instead of eval(), a pattern rule won't catch it.

Three patterns that scanners reliably recognize (because they're registered in most default rule sets):

Parameterized queries — db.query('SELECT ... WHERE id = ?', [userId]) in most SQL libraries
Framework-native escaping — res.render() with template engine auto-escaping, encodeURIComponent(), DOMPurify.sanitize()
Type-narrowing casts — parseInt(), Number(), parseFloat() for numeric sinks

Three patterns that scanners frequently miss:

Custom validation functions — validateUserId(id) unless you've added it as a sanitizer in your rule config
Middleware-applied sanitization — Express middleware that strips dangerous characters before the route handler runs; the scanner often sees the route handler in isolation
Indirect safe writes — ORM methods like User.findById(id) where parameterization is internal to the ORM but not visible in the call signature

If your scanner is producing false positives on code you know is safe, check whether the sanitization path is one the tool can see—and if not, add a custom sanitizer rule or inline suppression comment with an explanation.

Pattern Matching vs Semantic Analysis

SAST tools use two complementary detection strategies: pattern matching and semantic analysis. Understanding the difference explains why some findings are precise and others are noise.

Pattern matching looks for code shapes that match known vulnerability signatures. A Semgrep rule like this:


rules:
  - id: dangerous-eval
    pattern: eval($X)
    message: "eval() with dynamic input is a code injection risk"
    severity: WARNING
    languages: [javascript, typescript]

This matches any call to eval() regardless of context. It's fast, easy to write, and catches obvious cases. But it also flags eval('2 + 2') where the input is a hardcoded string—a false positive.

Semantic analysis goes deeper. Instead of matching syntax, it reasons about what the code means:

Is the argument to eval() user-controlled or hardcoded?
Has the input been validated before reaching eval()?
Is the eval() inside a sandbox or restricted execution context?

Technique	Strengths	Limitations
Pattern matching	Fast, easy to write rules, low false negatives for known patterns	High false positives, no context awareness, misses novel variants
Semantic analysis (data flow + taint)	Context-aware, lower false positives, catches complex flows	Slower, harder to implement, can miss patterns without explicit rules
Combined (modern tools)	Best of both—fast pattern pre-filter, deep analysis for flagged code	Still limited by language support and analysis depth

The practical implication: when you see a SAST finding, check whether it's pattern-based or data-flow-based. Pattern findings ("this function is dangerous") need manual triage. Data flow findings ("untrusted input reaches this dangerous function via this specific path") are far more likely to be real vulnerabilities.

Why Static Analysis Has Limits

False positives are not a tuning problem—they're a structural one. A 2025 study scanning nearly 3,000 open-source repositories found that over 91% of flagged vulnerabilities were false positives. Python/Flask command injection checks hit 99.5% false-positive rates. The root cause isn't bad rules; it's that static tools can't see runtime context. The following categories explain why.

Static analysis is powerful but provably incomplete. There are categories of vulnerabilities that no static tool—current or future—can reliably detect.

Runtime state is invisible. Static analysis sees code, not execution. It can't know that process.env.DATABASE_URL points to a misconfigured production database, that a race condition only manifests under high concurrency, or that an external API returns unexpected data. Environment variables, database contents, network responses, and timing behavior are all opaque to static tools.

Business logic is out of scope. Consider this code:


// No scanner will flag this — but it's a vulnerability
app.post('/transfer', (req, res) => {
  const amount = parseFloat(req.body.amount);
  // Missing: check that amount > 0
  // An attacker can transfer negative amounts to steal money
  await transferFunds(req.user, req.body.recipient, amount);
});

Every line is syntactically correct. No tainted data reaches a dangerous sink. The parameterized query under transferFunds is safe. But the missing validation on amount > 0 is a critical business logic flaw. SAST tools don't understand your business rules—they understand code patterns.

The halting problem sets a hard ceiling. Rice's theorem proves that no algorithm can determine all non-trivial semantic properties of arbitrary programs. In practice, this means static analysis must over-approximate—it assumes some impossible paths are possible, generating false positives, or under-approximate—it ignores some real paths, generating false negatives. Every tool makes this tradeoff, and no tool can avoid it entirely.

Dynamic code defeats static analysis. Reflection, eval(), dynamic imports, metaprogramming, and code generation all create execution paths that don't exist in the source code. The scanner sees eval(x) but can't know what x will be at runtime. Monkey-patching in Python or Ruby can redefine methods after the scanner has analyzed them.

This isn't an argument against static analysis. It's an argument for understanding what your scanner can and can't do—so you don't treat a clean scan as proof that your code is secure.

Practical Implications for Developers

Understanding the static analysis pipeline changes how you write and review code. Here are five practices that make your code more scannable—meaning your SAST tool produces more accurate results and fewer false positives.

1. Use parameterized queries and framework-native sanitization. Scanners have built-in knowledge of standard sanitization APIs. When you use the canonical approach (parameterized queries for SQL, template escaping for HTML, encodeURIComponent for URLs), the scanner recognizes them automatically. Custom sanitization functions are invisible unless you configure custom rules.

2. Keep data flows short and linear. The more transformations and function hops between a source and a sink, the harder it is for the scanner to maintain taint accuracy. A direct flow like req.query.id → db.query() in the same function is easy to analyze. A flow that passes through five helper functions across three files is harder—and more likely to produce either false positives (scanner loses track of sanitization) or false negatives (scanner gives up tracking).

3. Avoid dynamic code generation. Every eval(), dynamic require(), or computed property access is a point where static analysis goes blind. If you must use dynamic patterns, isolate them—keep the dynamic code in a small, well-defined module so the scanner's blind spot is contained.

4. Favor explicit over implicit. Type annotations, explicit return types, and clear variable naming all help scanners (and humans) trace data flow. TypeScript code is generally more scannable than JavaScript because type information helps the scanner prune impossible paths.

5. Treat scanner warnings as code review comments, not verdicts. A finding isn't a conviction—it's a hypothesis. Triage by checking: is the source actually user-controlled? Is the sink actually dangerous? Is there a sanitizer the scanner didn't recognize? If you suppress a finding, document why with an inline comment so the next developer (and the next scan) has context.

A clean SAST scan doesn't mean your code is secure. It means no known patterns of vulnerability were detected along analyzed paths. Runtime behavior, business logic, and novel attack patterns all live outside the scanner's view.

The Static Analysis Pipeline: End to End

Putting it all together, here's the complete pipeline from source code to actionable finding:

Each stage filters and enriches. The parser catches syntax errors. The CFG identifies unreachable code (dead code elimination reduces noise). Data flow analysis prunes impossible taint paths. Pattern matching against rule databases generates candidate findings. Severity scoring prioritizes the output.

The whole pipeline runs in seconds to minutes depending on codebase size. A 100,000-line Node.js application typically takes 30-90 seconds for Semgrep and 2-5 minutes for CodeQL. This speed is what makes SAST viable for CI/CD—it can gate every pull request without slowing developers down significantly.

How Rafter Approaches Static Analysis

Traditional SAST tools are deterministic—they find what their rules describe, nothing more. This works well for known patterns like SQL injection and XSS, but it struggles with the novel, context-dependent patterns that appear frequently in AI-generated code.

Rafter combines a traditional static analysis layer with an AI-powered contextual analysis engine. The static layer runs rule-based detection for known vulnerability classes—the same taint analysis and pattern matching described above. The AI layer reads the code as a developer would, understanding intent and identifying security issues that don't match any existing rule.

This hybrid approach matters because AI-generated code often introduces subtle vulnerabilities that are syntactically valid and pass pattern-based checks but are logically insecure—missing authorization checks, improper error handling that leaks sensitive data, or trust boundary violations between components. The AI layer catches these while the static layer handles the deterministic baseline.

You can run a scan on your repositories at rafter.so to see how both layers analyze your codebase.

Conclusion

Static analysis is a pipeline: parse source into an AST, build control flow graphs, trace data flow, apply taint analysis, and match patterns against vulnerability rules. Each stage adds precision, and each stage has limits. Understanding this pipeline helps you write code that your scanner can analyze accurately and triage findings that your scanner might get wrong.

The key takeaway: SAST is powerful but provably incomplete. It excels at finding known vulnerability patterns in code it can fully analyze. It fails at runtime behavior, business logic, and novel attack patterns. Treat it as a floor—an automated baseline that catches the automatable—not as a ceiling.

Next steps:

Run Semgrep or CodeQL on your main repository today—both are free for open source
Review the findings with the taint analysis model in mind: source → propagation → sink
Write custom rules for your framework's specific sanitization functions
Read the security tool comparison guide to understand where static analysis fits alongside other scanning approaches
Try Rafter to see how AI-powered analysis complements traditional SAST