The Notion Agent benchmark: what the +14% measures
Notion developed a specific benchmark to measure AI models' ability to complete multi-step workflows within their environment. It's not about answering a single question, but completing sequences of related actions requiring planning, context memory between steps and the ability to correct intermediate errors.
On this benchmark, Opus 4.7 records a 14% improvement over Opus 4.6. For an agentic workflow, a 14% gain in completion rate translates to a significantly larger number of tasks the agent completes autonomously without requiring human intervention.
The types of tasks measured by Notion Agent include: researching and synthesizing information from multiple sources, updating documents based on complex instructions, creating document structures from specifications, and managing multi-step editorial workflows. These are representative of daily enterprise use of an AI assistant on knowledge management tools.
For those using Claude in agentic mode — not as a chatbot but as an agent completing tasks autonomously — this +14% is the most relevant Opus 4.7 data point, alongside the introduction of the `xhigh` level in effort control. The Claude Managed Agents article covers the general framework for building agents on Claude.
Multi-day enterprise workflows: spreadsheets, slides and documents
Anthropic identified a specific category of use cases for Opus 4.7: multi-day enterprise workflows on productivity tools — complex spreadsheets, PowerPoint presentations, policy documents or structured reports.
What makes these workflows different from a simple conversational interaction? Duration and required coherence. An agent working on a complex financial model must maintain consistency between assumptions used in different file sections, remember decisions made in previous steps and adapt subsequent work based on intermediate results. A standard conversational session isn't designed for this type of task.
Opus 4.7 is built to handle this type of scenario: the 1 million token context window ensures the entire workflow history remains accessible to the model, and improvements to reasoning allow maintaining coherence over long and complex reasoning chains.
Concrete cases include: iterative construction and updating of financial models (relevant for the workflow described in the Claude for financial modelling article), preparing due diligence reports on multiple documents, iterative generation and revision of executive presentations, and managing document update workflows in compliance and regulatory contexts.
Building AI agents for your enterprise?
30 minutes to discuss your specific case.
Effort control xhigh: calibrating reasoning and latency
The `xhigh` level is the most relevant technical innovation in Opus 4.7 for those building AI agents. To understand its value, it's useful to review Claude's effort control system.
Effort control is the parameter that determines how much the model invests in reasoning before responding. At `low` level, the model responds quickly with minimal reasoning — suitable for simple tasks. At `max` level, the model invests maximum in reasoning — suitable for the most complex tasks but with higher latency. Intermediate levels allow calibrating this tradeoff.
With Opus 4.7, the levels are: `low`, `medium`, `high`, `xhigh`, `max`. The new `xhigh` sits between `high` and `max`, offering a balance point that didn't exist before.
In an agentic workflow, this is particularly useful. Steps in an agent don't all have the same criticality: some require deep reasoning (analyzing a complex document, deciding how to proceed when facing ambiguity), others require only rapid execution (formatting output, calling an API, saving a result). With `xhigh`, critical steps can use more reasoning without paying the `max` latency cost for the entire pipeline.
Recommended pattern: use `max` only for initial workflow planning and critical decisions; `xhigh` for complex analysis steps; `high` for standard execution. This approach optimizes both output quality and total task completion time.
Building enterprise agents with Opus 4.7: practical patterns
Opus 4.7's improvements for agents translate into specific implementation patterns worth knowing.
The first pattern is hierarchical planning: instead of giving the agent a complex task to execute in a single step, use Opus 4.7 with `max` effort to decompose the task into a structured plan with explicit steps, then execute steps with effort levels calibrated to their complexity. This approach improves overall workflow coherence and reduces intermediate errors.
The second pattern is intermediate verification: at defined intervals in the workflow, use Opus 4.7 to verify the coherence of the current state relative to the initial objective. With Notion Agent's +14%, Opus 4.7 is significantly more reliable in detecting deviations from the objective and proposing course corrections.
The third pattern is specialization-based multi-agent: different specialized agents (coding, document analysis, research, writing) orchestrated by a coordinator agent using Opus 4.7 for routing decisions. The 1 million token context window allows the coordinator agent to maintain full context of all specialized agents.
For enterprises wanting to build on these patterns, the how to implement AI agents with Claude guide is the operational reference. Maverick AI designs and implements enterprise agentic architectures with Claude Opus 4.7 — if you're building agentic workflows for your organization, let's talk.
Current limitations of agents with Opus 4.7
Opus 4.7's improvements don't eliminate the structural limitations of current agentic workflows. An honest assessment is necessary before planning production deployments.
The first limitation is cost. An agent using Opus 4.7 for long and complex tasks consumes a significant volume of tokens — both in input (the entire workflow context) and output (reasoning steps and intermediate outputs). For high-volume workflows, the cost can be relevant. API budget planning is a necessary step before deployment.
The second limitation is latency. Even with `xhigh` instead of `max`, complex agentic workflows have latencies of minutes, not seconds. For tasks where time-to-result is critical (real-time responses, interactive user interfaces), agents with Opus 4.7 are not the appropriate solution.
The third limitation is error handling. Notion Agent's +14% means Opus 4.7 completes more tasks successfully — but any remaining failure margin on complex tasks is not negligible. Production agentic workflows must include robust error handling, manual fallbacks and output monitoring.
The fourth limitation is supervisability. Agents completing long tasks autonomously produce output difficult to verify in its entirety. Defining control checkpoints and output verification criteria is an essential part of responsible agentic workflow design.