AI Pentesting VS AI Code Scanning

A bakeoff between a next-gen pentester and code scanner

Penetration testing and code scanning have long been staples in the defender’s toolkit. Automated code scanners can uncover coding errors at scale, but they’re notorious for being noisy. Manual penetration testing, on the other hand, excels at finding the vulnerabilities real attackers exploit - but it’s expensive and difficult to scale. Both approaches have their place in modern software development, and both are now being reshaped by AI.

When Anthropic rolled out a security scanner in Claude Code, I decided to pit it head-to-head against Shinobi, our AI-powered penetration tester. I wanted to see if AI-powered code scanning had finally closed the gap with AI-pentesting.

You’d think that if AI can truly understand code, it should be able to spot vulnerabilities once invisible to legacy scanners. What I found instead was that the gap remains wide. There’s still something about testing a live, running application that uncovers serious bugs which code scanning alone can’t catch.

The Setup

We recently added an application to Shinobi’s “dojo”, the sandbox where it trains and practices. This was a complex application with 100s of API endpoints, 10 user roles and had never been tested before. I set Shinobi loose on it for a pentest, and I also ran Claude Code’s security scanner on its codebase.

There were no intentionally planted vulnerabilities. It was just “vibe coded,” and I was curious to see how each tool would fare in the wild.

Findings

Rather surprisingly, Claude’s security scanner flagged exactly one issue: An API key exposed in logs.

Given the state of vibe-coded apps, I had expected a goldmine of vulnerabilities. Instead, the scan results were almost eerily sparse. I checked what vulnerability classes the scanner was designed to detect, quite a lot, as it turns out and yet it still surfaced only a single finding.

Shinobi’s pentest results, however, were a different story. Within minutes, it had uncovered a registration flaw that allowed anyone to create an account with any role, including admin. This was like the tactical nuke of vulnerabilities, the one bug to rule the entire app. At this point it was pretty much game over and I stopped looking at the pentest findings.

Essentially, Shinobi had reasoned - “are there any privilege escalation paths in this application?” then proved there were by launching targeted attacks against the running application.

Out of curiosity, I asked Claude Code the same question. It still didn’t find the vulnerability, though it did suggest some hardening measures.

Thoughts

AI-powered code scanners will continue to improve at spotting technical flaws and coding errors. But the vulnerabilities that most often compromise modern applications aren’t syntax issues - they’re failures in authentication, authorization, and business logic i.e. semantic vulnerabilities.

Detecting them requires understanding the intent of the developer, which is easier to infer once the application is deployed and running. It’s way harder to figure it out by merely looking at code.

AI code scanning has promise, especially for catching low-hanging fruit. It’s already more powerful than the static analysis tools of today. But AI-powered penetration testing remains unmatched for uncovering the kind of high-impact flaws that affect modern applications. And now it’s faster, more surgical and built to scale like never before.

Last updated