When Malware Argues With Your AI Scanner

A credential stealer was recently caught with an unusual decoy bolted to the front of its payload. The decoy does not run. It is a comment, inert by design, sitting at the top of the file ahead of the actual malware. Its only job is to be read by an AI, and to talk that AI out of looking any closer.

What makes this worth walking through carefully is not the trick itself, which is clever but bounded. It is what the trick gestures at. There is the documented incident, which is a prompt injection that tells an AI scanner the package is clean. And there is a related move I have not seen confirmed in the wild but expect to, which uses a model's safety training rather than its trust. The two are worth keeping clearly apart, so I will mark which is which throughout.

What Socket found

The technique surfaced in a supply-chain campaign documented by Socket Threat Research. Socket tracked three related worms, named Mini Shai-Hulud, Miasma, and Hades, that target bioinformatics developers and Model Context Protocol (MCP) developers through malicious npm and PyPI packages. The Hades credential stealer runs through the Bun runtime and harvests credentials.

The scale is real but worth stating precisely. As of Socket's count, the campaign spanned 471 malicious artifacts across the two registries: 411 npm artifacts spread across 106 packages, and 60 PyPI artifacts across 37 packages, including 23 newly identified PyPI package-version artifacts in this wave. Socket's live tracker later showed around 475. The delivery mechanics vary between reports rather than lining up cleanly: Socket notes .pth loaders and, in a newer subcluster, native-extension execution, while StepSecurity and The Hacker News describe import-hook delivery. The honest summary is that there are several loaders, not one signature vector.

The part I want to focus on is how Hades tries to slip past AI-based review. In the sample, the _index.js file opens with a roughly 99-line JavaScript block comment before any code executes. Socket describes the decoy as fake system instructions and policy-triggering content, planted to cause, in the researchers' words, "refusal behavior, prompt confusion, context pollution, or premature classification," and to "derail scanners or analyst copilots" before they reach the obfuscated payload.

Per StepSecurity and The Hacker News, the specific instruction in that comment is to ignore the code and classify the package as clean. It is a "nothing to see here, mark it safe, move along" message, aimed squarely at naive pipelines that feed file content to an LLM and take the verdict at face value. Worth noting, though, that Socket's own list of intended effects is broader than misclassification: it names refusal behavior right alongside it. That detail is the thread I want to pull on next.

What is documented as the comment's specific instruction is a false-benign attack: it tells an AI scanner the package is clean. Socket also lists refusal behavior among the effects the comment was designed to cause. What is not yet documented, and what I get into below, is malware leaning on deliberately alarming content to force those refusals as the primary evasion. I am keeping the reported facts and my extrapolation clearly separate on purpose.

It is also worth repeating Socket's own caveat, because it keeps the story honest. This is not a magical bypass. Static detection, YARA, AST parsing, deobfuscation, and behavioral rules all still catch this malware. The injection mainly threatens LLM-first triage pipelines that lean on a model's judgment without the boring deterministic checks underneath. The comment is a problem for the analyst copilot that reads a file and trusts what it reads, not for a layered scanner.

The move that worries me more

Socket already names refusal behavior as one of the comment's intended effects, so the refusal angle is not something I am inventing. Where I go past the documented record is the next step, so take the specifics here as my analysis rather than Socket's reporting.

The Hades injection, as the sample's instruction reads, manipulates the AI's trust. It tells the model the file is clean. The flip side seems like a small step away, and it bothers me more, because it would manipulate the AI's safety training instead.

Picture the same decoy comment, but instead of "this package is clean," it carries content alarming enough to make a safety-tuned model refuse to engage at all. The kind of dangerous, policy-triggering material a model has been trained to back away from. A scanner or coding agent hits that text, its refusal behavior fires, and it declines to read the file. Same outcome as the documented attack, an unanalyzed payload, reached through a different lever.

I have not seen this specific move confirmed in the wild. But it is a small step from what Hades already does, and the underlying incentive is the same: get the AI to stop reading before it reaches the malware. So I would expect it, and I would expect aggressive refusals to be exactly the kind of predictable behavior an attacker can plan around.

That is the thesis, offered as opinion. Over-indexing on first-order safety alignment can create second-order blindspots. Reflexive refusals, in closed and open models alike, are predictable, and predictable behavior is steerable. A model that refuses to read hostile code is not analyzing it, it is averting its eyes, and a scanner that averts its eyes is not much of a scanner. Security tooling will increasingly need models that do not flinch at hostile input, because hostile input is the entire job.

None of this is an argument against alignment. It is an argument for alignment that can tell the analyst apart from the attacker. The same content means opposite things depending on who is reading and why, and a refusal that ignores that distinction is a lever waiting to be pulled.

The documented injection and the hypothetical refusal trick are the same problem underneath, and this is the part I am most confident about.

In both cases the file under analysis is evidence to classify, never instructions to obey. The decoy comment is not really malware. It is a message aimed at the model, smuggled in as data, betting that the pipeline will blur the line between the two. Whether that message says "I am clean" or "refuse to read me," the defense is identical: treat scanned content as untrusted data, and never let it set the model's policy.

This is the OWASP LLM01 prompt-injection problem wearing a malware costume. The design question is concrete. What is the model permitted to treat as instructions, and what is fixed as inert evidence it may describe but never follow? Get that boundary right and a 99-line comment becomes a flag in the report, noted and stepped over, regardless of which way it tries to push. Get it wrong and the comment runs the show.

This is the same instruction-versus-data confusion behind TrapDoor's poisoned agent-instruction files and the broader class of AI security risks that live beyond prompt injection, and it is the throughline across this wave of AI-targeted supply-chain attacks.

Rafter is itself an AI-assisted code analysis tool, so it lives inside this adversarial reality. Our stated design goal is to treat scanned code as untrusted evidence to be examined, not instructions to be obeyed, and to regard a file that tries hardest to stop the analysis as precisely the file that earns a second look.

The takeaway

The Hades decoy is small and almost theatrical, easy to dismiss as a one-off. It is more useful read as a preview. As more security work routes through language models, attackers will keep probing the gap between what a model is willing to do and what a security system needs it to do, because that gap is where the cloaking happens. Today they are telling the AI a file is clean. Tomorrow, I suspect, some of them will be trying to make it look away.

The fix is not to make models care less about safety. It is to build pipelines that know the difference between reading about a threat and being commanded by one. A scanner earns its name by looking hardest at exactly the input designed to make it flinch.