CursorBench 70%: what it measures and why it matters
CursorBench is the benchmark developed by Cursor — one of the most widely used AI-assisted code editors in the world — to measure a model's ability to complete real programming tasks. It's not an academic benchmark: it measures the percentage of tasks successfully completed on concrete code problems, with the development environment as context.
Opus 4.7 reaches 70% on CursorBench, against Opus 4.6's 58%. Twelve percentage points of difference. To contextualize: every percentage point on this type of benchmark corresponds to a class of real problems that the model can or cannot solve autonomously. A gain of 12 points on a task completion benchmark is significant.
70% doesn't mean 30% of responses are wrong: it means 30% of tasks require human intervention, review or additional iteration. The practical question for a development team is: which tasks? For routine tasks — completing functions, writing unit tests, documenting existing code — the actual success rate is typically higher. For structurally complex tasks — refactoring legacy systems, debugging race conditions, optimizing queries on unfamiliar codebases — the rate may be lower.
Using Claude in the development cycle is covered in the Claude Code for enterprises article.
Rakuten-SWE-Bench: 3x tasks solved in production
The most significant result in Opus 4.7's coding benchmarks is not CursorBench, but Rakuten-SWE-Bench. This benchmark, based on the version developed by Rakuten, measures the ability to resolve real issues on production repositories: not synthetic problems built for a benchmark, but bug reports and feature requests taken from real GitHub repositories.
Opus 4.7 solves 3x more tasks than Opus 4.6 on this benchmark. Tripling the resolution rate on real problems is a qualitative, not just quantitative, leap. It means an entire class of problems that Opus 4.6 couldn't resolve autonomously — because they required understanding the repository context, multi-file reasoning or analysis of complex dependencies — is now within reach of Opus 4.7.
The practical meaning for development teams is direct: if you use Claude to assist in resolving bugs or implementing features on existing codebases, Opus 4.7 will produce complete and correct solutions for a significantly larger number of cases. Fewer iterations, less manual review of output, more tasks resolved autonomously.
An element to keep in mind: Rakuten-SWE-Bench measures resolution on real production repositories, which tend to be more complex than sample projects. This makes the benchmark more predictive of actual behavior than benchmarks built on academic problems.
Want to integrate Claude Opus 4.7 in your development process?
30 minutes to discuss your specific case.
CodeRabbit and automated code review: +10% recall
CodeRabbit is a platform specialized in AI-powered automated code review. It integrates Claude to analyze pull requests, identify code problems, suggest improvements and verify compliance with team standards. Their benchmark measures recall — the percentage of problems actually present in the code that Claude can identify.
Opus 4.7 records more than 10% improvement in recall compared to Opus 4.6 on this benchmark. In other words: out of 100 problems present in the code, Opus 4.7 identifies at least 10 more than its predecessor. For code review, recall is the critical metric — an unidentified problem is a bug that reaches production.
The types of problems where improvement is most marked include: security patterns (injection, XSS, CSRF, hardcoded credentials), race conditions and concurrency issues, violations of architectural invariants, non-obvious performance problems, and inconsistencies between documentation and implementation.
For teams using Claude for code review — directly via API or through platforms like CodeRabbit — Opus 4.7 means fewer false negatives, meaning fewer problems that pass unnoticed. Integrating Claude into the code review process is discussed in the Claude vs Copilot article, which also analyzes the differences in approach between the two tools.
How the development workflow changes with Opus 4.7
Benchmarks translated into practical workflows suggest specific usage patterns that benefit most from Opus 4.7.
The most immediate use case is debugging on complex codebases. With Rakuten 3x, Opus 4.7 can resolve issues requiring contextual understanding of the repository — complex stack traces, bugs emerging from interactions between components, regressions that don't reproduce in isolation. The typical workflow: copy the relevant context (stack trace, code of involved modules, failing tests), ask Opus 4.7 to diagnose the problem and propose a solution, evaluate the proposed solution.
The second use case is structural refactoring. Opus 4.7 with a 1 million token context window can analyze entire modules or services, identify architectural improvement opportunities and propose a coherent refactoring plan. Not the type of refactoring done on 200 lines of code, but the kind requiring a holistic view across thousands of lines.
The third use case is test generation. With 70% on CursorBench, Opus 4.7 can generate comprehensive test suites for complex functions, including non-obvious edge cases, performance tests and mocks of external dependencies. The quality of generated tests is significantly higher than Opus 4.6 on complex code.
For code review in pull requests, the +10% improvement in recall translates to more complete feedback and fewer problems reaching merge. Direct integration with GitHub, GitLab or Bitbucket via API or tools like CodeRabbit is the most effective way to insert this into the existing development pipeline.
When to use Opus 4.7 vs Sonnet for coding
Opus 4.7's coding benchmarks are impressive, but they don't imply it should be used for everything. The choice between Opus 4.7 and Sonnet for coding tasks depends on task complexity and cost constraints.
Opus 4.7 is the right choice for: debugging on complex codebases with many dependencies, structural refactoring on large systems, code review of pull requests on critical code, code generation for systems where correctness is essential (security, financial data, critical infrastructure), analysis of poorly documented legacy codebases.
Sonnet remains the optimal choice for: completing simple functions, generating boilerplate, documenting existing code, writing tests for functions with clear interfaces, answering medium-complexity coding questions. Sonnet has a significantly lower cost per token and produces comparable quality output on these tasks.
An effective pattern for development teams is complexity-based model routing: a simple classifier determines whether the task requires Opus 4.7 (multi-file codebase, complex bugs, architectural analysis) or whether Sonnet is sufficient. This approach optimizes the quality-cost ratio without sacrificing quality where it matters. For teams wanting to deepen Claude integration in the development cycle, Maverick AI offers specific consulting.