84% on Firefox 147: the number that changes the conversation
On a real Firefox 147 exploitation benchmark, Claude Mythos Preview achieves 84.0% success. Claude Opus 4.6, the previous leading model, stops at 15.2%.
This is not an incremental improvement. It is a qualitative leap that places Mythos in a different category.
On CyberGym Vulnerability Reproduction — a set of real vulnerabilities to reproduce in a controlled environment — Mythos reaches 83.1% versus 66.6% for Opus 4.6. The gap is clear, but more contained. The Firefox benchmark is the one that most surprises: almost six times the success rate of its predecessor, on end-to-end exploitation of a modern browser with all protections active.
To understand what this means, you need to look at the method. Not the performance, the method.
The vulnerability categories Mythos identifies
The profile documented by Anthropic covers different categories, some of which are among the most complex in the offensive security landscape.
Buffer overflow with signed integer overflow. A concrete example: the 27-year bug in OpenBSD, where a null-pointer dereference arises from an overflow in a sequence number comparison. Not an obvious error, but the type of bug that survives decades of review because it only emerges under specific conditions.
Use-after-free and out-of-bounds read/write. Memory accesses after deallocation, reads and writes outside bounds — the source of most critical vulnerabilities in modern browsers.
Heap corruption with cross-cache reclamation. Techniques that exploit memory allocator behavior to overwrite critical data structures.
Combined multi-vulnerability attacks: JIT heap spray combining four distinct vulnerabilities, browser sandbox escape with renderer-to-OS privilege escalation, ROP chain distributed across multiple network packets. All on hardened systems with ASLR, stack protection and W^X active.
For logic vulnerabilities: authentication bypass, CSRF, injection, weaknesses in TLS/AES-GCM/SSH. For the kernel: KASLR bypass via deliberate kernel pointer disclosure to userspace.
One specific case for its technical clarity: the FFmpeg H.264 bug, where a slice number sentinel collision causes heap out-of-bounds write due to mismatch between 16-bit and 32-bit handling of the counter.
How the process works: from analysis to working exploit
The method has a recognizable structure. Understanding it is useful not just for evaluating Mythos' capabilities, but for understanding how a technical team can use similar approaches with models available today.
The first phase is source code analysis with hypothesis generation. The model does not mechanically scan for known patterns. It builds a mental model of the system — how components interact, where data flows, what implicit assumptions might be violated — and generates hypotheses about where problems might be hiding.
The second phase is dynamic testing with a debugger. Hypotheses are verified in a containerized environment, with runtime behavior analysis.
The third phase, the one that distinguishes Mythos from the previous model, is triage. Sonnet 4.6 improves if the main bugs are removed from context — it does not have an effective mechanism for autonomously filtering the most promising leads. Mythos immediately identifies the most effective vectors, automatically filters low-criticality findings, and converges on vulnerabilities worth developing.
At industrial scale: approximately 1,000 scans on OpenBSD at a cost of $20,000, with dozens of real findings as a result.
Train your technical team on Claude for code security
30 minutes to discuss your specific case.
Reverse engineering from binaries: a new and important capability
Among the documented capabilities, one deserves particular attention for its practical implications.
Mythos can reconstruct plausible source code from stripped binaries — executables from which debug information has been removed. Starting from machine code, it reconstructs the program's logic, data structures, and the programmer's implicit assumptions. Then it looks for vulnerabilities in this reconstruction.
The practical meaning: it is possible to do security research on closed-source firmware, on libraries distributed only in compiled form, on third-party components for which you do not have the source.
This changes the perimeter of code review. You are no longer limited to code you own. Any binary that enters the system — a dependency, a hardware component, a plugin — becomes analyzable.
For teams working on supply chain security or analysis of legacy components, this capability opens a scenario that until recently required specialized experts and much longer timeframes.
What changes for code review and secure development in teams
Mythos is not available in production. But the capabilities it demonstrates indicate a direction that technical teams can start pursuing with models available today.
Pre-commit and pull request review: integrating systematic security analysis into the development workflow, not as occasional manual review but as an automatic process on every change.
Vulnerability triage: when working on legacy codebases or analyzing dependencies, the ability to prioritize findings by real impact — rather than nominal severity — reduces time wasted on theoretical problems with low exploitation probability.
Contextual training: understanding how an exploit works on code similar to what you write every day changes how you write secure code. It is not abstract theory, it is pattern recognition applied to your own context.
Prompt engineering for code security is an area where investment in know-how produces measurable results in a short time.
How to train your team on Claude for code security
The gap between what AI models can do for code security and what technical teams actually use is still wide. Not because of a lack of model capabilities, but because of a lack of method and practice.
Using Claude for code review requires knowing how to structure requests, how to provide context, how to interpret results, and where the model tends to go wrong. It is not complicated, but it requires practice on real cases.
Maverick AI workshops for technical teams start here: not from theory about models, but from direct application to the team's code. We work on Claude Code for source analysis, build an AI-assisted code review workflow suited to the specific context, and practice on vulnerability types analogous to those present in the codebase.
The goal is not for the team to know what Mythos can do. It is for them to use Claude every day to write more secure code, find problems before they reach production, and reduce time spent on manual code review of patterns that a model recognizes in seconds.
If you want to understand how to structure such a path for your team, let's discuss it.