Vol. I · Apr 2026 / Nexurion Field Notes: Agentic AI · Pentesting · GRC Jack Giordano · 14 min read
Nexurion Field NotesVol. I · Apr 2026
Agentic AI · Pentesting · GRC posture · 14 min

When the attacker is no longer human.

Agentic AI pentesting reached production in 2025. XBOW hit #1 on HackerOne. Anthropic's Mythos found thousands of high-severity vulnerabilities in every major operating system and browser. The GRC implications aren't theoretical: they are 2026 audit findings waiting to be written.

Rebuilt
Field Notes
Vol. I

A pentest used to be a person.
Now it is a swarm.

Agentic Offensive AI
XBOW · Mythos · ARTEMIS
Apr / 26
XBOW · 40-hr pentest in 28 min Matched a 20-yr veteran across 104 challenges
Cost: agent vs. human $18/hr vs. $60/hr · ARTEMIS, live 8,000-host network
XBOW · valid HackerOne submissions 1,060+ #1 on HackerOne · June 2025
DARPA AIxCC · DEF CON 33 $152/task Detection rate 37 % → 86 %, all 7 systems open-sourced

For twenty years, "annual pentest" meant a person, on a laptop, on a sixty-day engagement, with a rate card that ran into five figures and a final report that arrived three weeks after the work ended. That definition stopped being economically rational sometime between June 2025 and April 2026.

The receipts are public. An autonomous AI agent took the #1 spot on HackerOne. Another beat nine of ten human pentesters on a live enterprise network at a third the hourly rate. A third: Anthropic's, run on an unreleased model: found "thousands of high-severity vulnerabilities in every major OS and browser" and was deemed too dangerous to release outside forty hand-picked organizations. This is the regulatory pretext for a generational shift in how risk analysis, vulnerability management, and continuous monitoring get audited.

The GRC question isn't whether agentic pentesting works. It is whether your risk register, your control environment, and your evidence locker reflect that it does. This is a field reading on what changed, who built it, and the seven additions every register needs before its next SOC 2, ISO 27001, HIPAA, CMMC, or AI-Act-adjacent audit.

§ 02 · The receipts

Three years from "clunky proof of concept" to "too capable to release."

The timeline below is not predictive. Every entry is a public, citable event with a CVE, a leaderboard, a press release, or an academic paper attached. Read it once before you read your auditor's next request list.

Late 2023NTU Singapore

PentestGPT: the proof of concept.

A team at Nanyang Technological University released PentestGPT: an LLM-driven pentest assistant that needed a human at the keyboard for every command. It was clunky. It also proved an LLM could reason about attack paths at all. Every system that follows traces back to this paper.

Source: arXiv:2308.06782 · NTU 2023
Nov 2024Google Project Naptime

Big Sleep finds a real zero-day in SQLite.

Google's AI-vulnerability-research project, Big Sleep, surfaced a stack-buffer-underflow CVE in SQLite that OSS-Fuzz had been missing for years. The first AI-discovered zero-day in production software. Not a benchmark. Not a CTF. A real bug in a real database used by every browser on the planet.

Source: Google Project Zero blog · Nov 2024
June 2025XBOW · HackerOne

XBOW takes #1 on HackerOne.

XBOW's autonomous agent: built by the team that created GitHub Copilot and Semmle/CodeQL: became the first non-human to top HackerOne's US leaderboard, then the global top shortly after. 1,060+ valid submissions, including a 48-step exploit chain that escalated a low-severity blind SSRF into full compromise. The headline was the leaderboard. The thesis is the chain.

Source: HackerOne, Aug 2025 · Black Hat USA · Brendan Dolan-Gavitt (NYU/XBOW)
Aug 2025DEF CON 33

DARPA AIxCC finals: Team Atlanta wins $4M.

The DARPA AI Cyber Challenge final round at DEF CON 33 ran seven autonomous cyber-reasoning systems against real codebases. Detection rate jumped from 37 % to 86 %. Cost: $152 per task. All seven systems open-sourced as a condition of the prize. The next year of OSS pentest agents is downstream of this dataset.

Source: DARPA AIxCC · DEF CON 33 · Aug 2025
Nov 2025XBOW · Pentest On-Demand

The traditional pentest contract stops being economically rational.

XBOW launched Pentest On-Demand: compress a 35–100-day pentest cycle into hours, at a fraction of the $10K–$35K typical cost. The point isn't price. The point is that "annual" stopped being a defensible cadence the moment continuous became cheap. Every "annual pentest" control language in your control set is now stale.

Source: XBOW launch announcement · Nov 2025
Dec 2025ARTEMIS

ARTEMIS beats 9 of 10 human pentesters at $18/hr.

The first head-to-head AI-vs-human comparison on a live 8,000-host enterprise network. ARTEMIS beat nine of ten human pentesters at $18/hour vs. the human rate of $60/hour. The economic argument for "we run an annual external pentest" is now an argument against your own auditor.

Source: ARTEMIS public benchmark · Dec 2025
Apr 2026Anthropic Project Glasswing

Mythos Preview: too capable to release.

Anthropic's unreleased model, code-named Claude Mythos, ran inside Project Glasswing and found thousands of high-severity vulnerabilities in every major OS and browser. Anthropic judged it too capable to release broadly and limited it to forty hand-picked organizations. The first time an AI lab has restricted a security tool's distribution on the basis of its own offensive capability. Read the usage-policy implications twice.

Source: Anthropic blog · April 2026 · Project Glasswing
Apr 2026XBOW Series C

$237M total raised · $1B+ valuation.

XBOW closed a $120M Series C in March 2026 at a unicorn valuation, bringing total raised to $237M. RunSybil raised $40M; PentAGI passed 14,700 GitHub stars; Hadrian launched Nova; 39+ open-source agents are now publicly cataloged. The capital and the OSS depth together are what move this from research to procurement.

Source: SecurityWeek · AppSecSanta agent census · 2026
§ 03 · The cast

Nine systems your auditor will eventually name.

Profiles below cite only public material: leaderboards, papers, press releases, and the vendors' own platform pages. We have engaged with three of these on client tabletops. None on production code. Where a system has not produced a publicly defensible result, we say so.

XBOW
CommercialClosed-source
Production

Founded 2024 by Oege de Moor (creator of GitHub Copilot, founder of Semmle/CodeQL). Thousands of short-lived agents in parallel with deterministic validators that confirm exploitability before surfacing findings. CTF-style "canary" instrumentation suppresses LLM hallucination. #1 HackerOne · June 2025 · 1,060+ valid submissions.

$237M raisedxbow.com
Anthropic Mythos / Glasswing
Restricted40 orgs
Limited

An unreleased Anthropic model, run inside Project Glasswing. Found thousands of high-severity vulnerabilities in every major OS and browser. Distribution restricted on the basis of its own offensive capability: a regulatory and commercial first. The pattern shifts the locus of cyber-insurance and AI-Act conversations from "what can the model do" to "what is the lab willing to release."

Apr 2026anthropic.com
Google Big Sleep
Project ZeroInternal
Research

Google's AI-driven vulnerability research, an evolution of Project Naptime. First AI-discovered zero-day in production software: a SQLite stack-buffer-underflow that OSS-Fuzz had been missing for years. The thesis: agents read code differently than fuzzers, and find bugs fuzzers can't.

Nov 2024googleprojectzero.blogspot.com
ARTEMIS
BenchmarkPublic results
Emerging

The first head-to-head AI-vs-human comparison on a live 8,000-host enterprise network, December 2025. Beat nine of ten human pentesters at $18/hour vs. $60/hour. The "live network" qualifier matters: most prior benchmarks were CTF or synthesized. ARTEMIS is what your insurer will cite when you renew next year.

Dec 2025Public benchmark
PentAGI
Open source14,700+ ★
OSS

The dominant open-source pentest agent. Orchestrates four sub-agents in Docker sandboxes inside a ReAct loop. 14,700+ GitHub stars makes it the OSS reference for "what an unsophisticated attacker can deploy in a weekend." Read: the floor of capability is rising in public.

OSS · MITgithub.com
Shannon
OSS agent96.15%
OSS

Currently leads XBOW's own 104-challenge Validation Benchmark with a 96.15 % success rate (100/104 exploits). An open-source agent outperforming a closed commercial one on the latter's own benchmark is not a normal market shape. The implication: the "moat" in agentic pentesting may be data and orchestration, not the agent itself.

OSSXBOW Validation Benchmark
RunSybil
CommercialSeries A
Funded

Raised $40M in 2026 for autonomous offensive security. The market validation matters more than the architecture: two unicorns in agentic pentesting in eighteen months means insurance carriers, board audit committees, and procurement teams will start asking about it within the year.

$40Mrunsybil.com
Hadrian Nova
CommercialContinuous
Production

Hadrian's agentic continuous-pentesting product, launched 2026. The relevant GRC angle is not the platform: it is the "continuous" qualifier. If a vendor sells a continuous pentest, your "annual external penetration test" SOC 2 control just inverted from prudent to insufficient.

2026hadrian.io
Bishop Fox Cosmos / CAST AI
CommercialHybrid
Hybrid

The incumbent's response. Bishop Fox's Cosmos platform layered AI agents on top of the firm's existing managed offensive security service. The hybrid model: agent for exploration, human for validation: is the bet that's most likely to age well through the next two audit cycles. It is also the model the SEC, OCR, and FDA are most likely to accept on first reading.

Hybridbishopfox.com
"
Even right now, after one year, I don't know any other company that is at least close to XBOW in terms of agentic pentesting.
- Public testimonial · xbow.com · cited Aug 2025
§ 04 · The register

Seven additions to your risk register before the next audit.

Each row maps a specific 2025–2026 capability to a control family that is now stale, the GRC framework where the staleness will be cited first, and the artifact a thoughtful auditor will start asking for. None of the artifacts are exotic. All seven are deliverable in a quarter if the trigger is named today.

# The new addition Why · what auditors will ask Framework
01 "Annual pentest" cadence is no longer reasonable. Pentest On-Demand compresses a 35–100-day cycle into hours. SOC 2 CC7.1 and ISO 27001 A.8.8 ask for a vulnerability-management program "commensurate with risk." Continuous is now the commensurate cadence; annual is not. Expect language change in the auditor's request list, not the standard. SOC 2 CC7.1 · ISO A.8.8
02 Threat model must name agentic offensive AI as a threat actor. A threat catalog that stops at "external attacker · insider · nation-state" omits the actor that beat 9 of 10 human pentesters at $18/hr. ISO 27001 A.5.7 (threat intelligence) and the HIPAA Security Rule's 164.308(a)(1)(ii)(A) "accurate and thorough" risk analysis both require the model to reflect current capability. It does not. ISO A.5.7 · HIPAA 164.308
03 Vulnerability-scan frequency belongs in days, not quarters. CMMC SI.L2-3.14.1 and PCI DSS 11.3 set quarterly/annual floors. Floors are not ceilings; auditors will start treating the 2026 floors as 2014 floors. Document a continuous scanning cadence and the exception process for the windows where it doesn't run. CMMC 3.14.1 · PCI 11.3
04 Vendor-pentest evidence must be dated, not annual. Third-party risk programs that accept "vendor's last pentest report" satisfy SOC 2 CC9.2 and HIPAA 164.308(b) on paper. They no longer satisfy them in fact: a six-month-old report describes a network state two XBOW runs ago. Ask vendors when they last tested, not whether. SOC 2 CC9.2 · HIPAA 164.308(b)
05 Use of an agentic pentester is itself a processing activity. An agent reading production data is a data-processing activity under GDPR Art. 30 and a workforce-access event under HIPAA 164.308(a)(4). Sub-processor lists, BAAs, and the records-of-processing register all need a row that names the agent vendor, the data classes touched, and the retention. GDPR Art. 30 · HIPAA 164.308(a)(4)
06 AI-system inventory must list offensive AI tooling under your control. EU AI Act Title III dual-use scrutiny and NIST AI RMF GenAI Profile (NIST AI 600-1) §3 cyber-offense risks both need the inventory to identify any agentic offensive AI you operate. ISO 42001 A.6.1.4 (AI system impact assessment) is the right place to record the assessment. EU AI Act · ISO 42001 · NIST AI 600-1
07 Insurability: your carrier's renewal questionnaire already has this row. Cyber-insurance underwriters that didn't ask about agentic pentesting in 2024 will ask in 2026. Treat the renewal questionnaire as your forward-looking control catalog. The honest answers are usually "no" today; the roadmap answer is what the carrier wants to read. Carrier policy · Sched. A
§ 06 · The new audit question

The question your auditor will start asking in 2026.

Audit firms are conservative: they read what the standards say, then they read what the market does, and the request list is the second one. The mock at right is the question we expect to see in management representation letters by mid-2026. Read it as a forward-looking control test, not a prediction.

If the answer is "we have not run an agentic pentest": that is fine, today, with the right compensating language. The wrong answer is silence.

Auditor / management-rep · 2026 Q3 · draft
07.4: Describe the cadence and scope of any agentic AI penetration testing performed against in-scope systems during the audit period, including the named system (e.g., XBOW, Hadrian Nova, Bishop Fox Cosmos), the engagement authorization, and the disposition of findings.
Maps to
CC7.1 · CC4.1
Evidence type
Agent-run log + remediation register
If not performed
Compensating control narrative
Status
draft · expected 2026-Q3
§ 07 · Retractions

Four positions we are willing to retract.

The systems and benchmarks in §02 and §03 are public and citable. The thesis: that agentic pentesting is a 2026 audit-finding lever: is ours. If the next twelve months show otherwise, we will say so in print, in the next volume's masthead.

  • If XBOW's 28-min-vs-40-hr result fails to replicate on a second independent benchmark this year, the §04 row #01 ("annual is stale") softens to "annual is not yet stale."
  • If Anthropic releases Mythos broadly without restriction, the §05 Q-03 dual-use posture argument needs to be re-grounded against a different precedent: likely OpenAI's o-series cyber-offense evals or a future EU AI Office determination.
  • If the AICPA explicitly clarifies that "annual pentest" remains the SOC 2 CC7.1 expectation regardless of agent capability, §04 row #01 inverts to a nice-to-have rather than a finding.
  • If a published court decision rejects Kovel-style structure for AI-agent pentest engagements in the First Circuit, §05 Q-04's privilege guidance is stale on issuance and needs to be rewritten before relying on it.

The 2025 data: XBOW at #1, ARTEMIS at $18/hr, Mythos restricted: is unambiguous on capability. The GRC framing is the part that depends on how auditors and regulators move next. We will revise here if 2026 inverts it.

Run the seven additions against your register?

A 45-minute call. We walk your current control set against the seven rows in §04 and tell you which two are missing. No deck, no nurture sequence, no follow-up unless you reply.