Shannon Entropy Wasn't Enough: Why Betterleaks Replaces Gitleaks

Written by the Rafter Team

For about a decade, the answer to "how do you tell whether a string is a secret?" had three parts: keyword filters, regex patterns, and a Shannon entropy score above some threshold. Gitleaks shipped that recipe in 2018 and the rest of the ecosystem — TruffleHog, detect-secrets, git-secrets — converged on close variants.
We trusted that recipe. Most of the industry did. Then the original author of Gitleaks, Zach Rice, quietly admitted it wasn't good enough and shipped a successor that lifts recall from 70.4% to 98.6% on the canonical benchmark.
The new tool is called Betterleaks. It's open source, sponsored by Aikido, and a drop-in replacement for Gitleaks. The interesting part isn't the rebrand — it's what made the rebrand necessary.
On the CredData dataset, Shannon-entropy filtering hits 70.4% recall for catching real secrets. Betterleaks's new "Token Efficiency" filter, built on the same byte-pair encoding tokenizer GPT‑4 uses, hits 98.6%. That's a 28-point jump from a model swap. It also pulls precision from 21.1% to 57.3%.
What actually happened
Zach Rice created Gitleaks roughly eight years ago and led it through hundreds of contributors and millions of installs. In early 2026 he joined Aikido Security as Head of Secrets Scanning. A few weeks later, he posted a short, transparent note: "I don't have full control over the Gitleaks repo and name anymore."
So instead of fighting over the name, he started over.
Betterleaks v1 shipped in March 2026. It's pure Go, MIT licensed, parallelized, and reads existing gitleaks.toml configs without changes. The CLI flags Gitleaks users have muscle memory for — detect, protect, --source, --report-format sarif — all carry over. If you swap gitleaks for betterleaks in your CI step today, it just works.
What's new is what's underneath the regex layer.
The Shannon entropy story (and why it felt right)
Shannon entropy measures the average unpredictability of the characters in a string. A long string of aaaaaa has near-zero entropy. A long string of kj3J*&hd9!Mz has high entropy. The intuition is obvious — secrets look random, randomness has high entropy, so you set a threshold and flag anything above it.
This was a beautiful idea on paper. It's a single number. It has a respectable information-theory pedigree. It's fast to compute. For years it was the second line of defense behind regex matching, and for years we — and most of the industry — assumed the combination was Good Enough.
The problem is that "looks random to a byte counter" and "is actually a secret" are two very different claims.
Where entropy quietly fails
Look at four real strings you'd find in a typical codebase:
3f0a5e6e-9b5a-4c8e-87a4-2f9ec1f8d0a1 # UUID v4
sha256-AbC123XyZ4... # subresource integrity hash
SGVsbG8sIFdvcmxkIQ== # base64 of "Hello, World!"
sk-proj-7Yh2K... # real OpenAI API key
To Shannon entropy, all four look like high-entropy strings. UUIDs are random. SRI hashes are random. Base64 of anything looks random at the byte level. Real keys are random.
A pure entropy filter has no way to distinguish the secret from the three things that just happen to look like one. Either you set the threshold low and drown in false positives, or you set it high and miss real keys. On the CredData benchmark, Rice's analysis puts entropy-only detection at 21.1% precision and 70.4% recall — meaning roughly four out of five entropy hits are noise, and roughly three out of every ten real secrets slip through entirely.
We built a lot of pipelines on top of those numbers and didn't notice.
What Token Efficiency actually is
The replacement signal is unfamiliar at first because it borrows from a different field — language model tokenization.
Byte-pair encoding (BPE) is the algorithm that turns text into the integer "tokens" GPT‑4 and Claude consume. The training procedure is simple: scan a huge corpus of real-world text and code, find the most common adjacent byte pair, merge it into a new token, and repeat tens of thousands of times. The vocabulary that comes out reflects, by construction, how often character sequences actually appear in human-written text and source code.
Common things get cheap, multi-character tokens. The string import is one token. function is one token. userIdValidationToken decomposes into userId, Validation, Token — a few long tokens for a long string.
Genuinely rare or random sequences get expensive. They have no useful merges in the vocabulary, so they're broken into many short tokens — often almost one token per character.
That gap is the signal.
The metric Rice settled on is one line of math:
token_efficiency = len(string) / len(tokens)
Run a string through the cl100k_base tokenizer (the one GPT‑4 uses) and divide its character length by the number of tokens it produces. Natural language and structured identifiers compress well — they score high. Real secrets compress badly — they score low. As Rice puts it:
Secrets are rare. A b64 encoded string, a UUID, an actual secret, and a weird-looking dependency string can have similar entropy scores despite being fundamentally different in how often they appear in the real world.
Token efficiency captures the difference entropy can't, because it's measuring how surprised a model trained on the actual internet is by your string — not just how unpredictable the bytes look in isolation.
The numbers, with the caveat
Betterleaks ships token efficiency with a default cutoff of 2.5 for strings 12 characters or longer (with a slightly looser 2.1 fallback for shorter strings that don't contain three-character words).
On CredData, that single change moves the model from:
| Filter | Precision | Recall | F1 |
|---|---|---|---|
| Shannon entropy alone | 21.1% | 70.4% | 0.776 (with full config) |
| Token efficiency alone | 57.3% | 98.6% | 0.725 |
| Token efficiency + entropy combined | — | — | 0.892 |
A few things worth flagging honestly. Token efficiency alone has a lower F1 than entropy in the full Gitleaks config because the rest of Gitleaks does a lot of work — keyword filtering, regex specificity, allowlists. The headline result is the combined model: F1 climbs from 0.776 to 0.892, and the recall ceiling moves from 70.4% to 98.6%. That's where the practical accuracy gains live.
For teams running Gitleaks in CI today, the most honest summary is: same config, same flags, ~28 points more recall and a precision lift large enough to materially cut alert fatigue.
What else Betterleaks ships
Token efficiency is the big idea, but it's not the only change. From the release notes and Help Net Security's coverage:
- CEL-based rule validation. Rules can express conditional logic — file path, git author, surrounding context — using Google's Common Expression Language instead of patching it into Go.
- Doubly and triply encoded secrets. Real-world secrets are often base64'd inside JSON inside a YAML config. Betterleaks unwraps those layers automatically.
- Pure Go, no CGO, no Hyperscan. Easier to build, easier to ship in air-gapped environments, no glibc surprises.
- Parallelized git scanning. Repository scans are noticeably faster on multi-core machines, especially on large monorepos.
- Aho–Corasick keyword filtering. Pre-filters candidates before the regex engine touches them, which keeps the wall-clock cost of token-efficiency checks reasonable.
None of these alone would justify a rebrand. Together with the BPE shift, they describe a tool that's been rethought from the regex outward.
What this means for teams using Gitleaks
The migration story is, by design, almost a non-event. betterleaks reads gitleaks.toml. The CLI surface matches. SARIF output is identical, so existing dashboards and PR-comment integrations don't need code changes. CI configs need a binary swap, nothing else.
What changes is what shows up in the report. Two patterns we've already seen in the wild:
- Quieter PRs. The precision lift means fewer false-positive findings on UUIDs, content hashes, and lock-file artifacts that used to clog reviews.
- More real catches on private fixture data. The recall lift specifically helps with secrets that look "structured but rare" — short tokens, hex-encoded keys, custom-format API keys — exactly the categories entropy was tuned to miss.
If you're already on Gitleaks, the answer is to run both in parallel for a sprint and diff the findings. If you're starting fresh, Betterleaks is the default we'd pick today.
The uncomfortable part
The thing I keep coming back to isn't that Betterleaks is faster or more accurate. It's that I — and a lot of practitioners I respect — assumed Shannon entropy plus regex was already a solved problem. We were wrong, in the most ordinary way: a metric that felt right and measured something real also happened to confuse "rare" with "random." The benchmark was sitting there. Nobody pointed at it for years.
The fix didn't come from a new product or a new vendor. It came from the same person who shipped the original recipe, working on the same problem with a different signal, in the open. That's the pattern worth taking seriously.
We were comfortable with a 70% recall ceiling because the only available alternative was paying a vendor for a black box. Now the alternative is brew install betterleaks and a config file you already have.
What we run at Rafter
Rafter's secret-scanning pipeline ships Betterleaks (formerly Gitleaks under the hood) alongside our regex layer and AI-assisted triage. Every finding includes the file path, the rule it matched, the token-efficiency and entropy scores, and a plain-English fix prompt you can paste into your coding agent. If you'd like to see what your repos actually leak, run a fast scan — no setup, no signup walls, results in 30 seconds.
Further reading
- Rare Not Random — Zach Rice's full technical writeup
- Betterleaks announcement (Aikido blog)
- Betterleaks coverage at Help Net Security
- Top API Key Leak Detection Tools (2026)
- Secret Scanning in CI/CD: detect-secrets vs Betterleaks vs TruffleHog
- Pre-Commit Hooks for Secret Detection
- GitHub Secret Scanning: What It Catches and What It Misses