The visual acuity jump: what XBOW measured and what it means
XBOW is a benchmark specialized in measuring visual acuity — the ability of an AI model to correctly interpret visual content in documents: handwritten text, non-natively digital tables, charts, diagrams, chemical structures, technical schemas. It is one of the most relevant tests for those using AI in professional document contexts where not everything is natively digital.
Anthropic's published results for Opus 4.7 are direct: 98.5% against Opus 4.6's 54.5%. A gain of 44 percentage points. To contextualize: with Opus 4.6's 54.5%, nearly half of visual interpretations were wrong or incomplete — a reliability level insufficient for any professional use on critical documents. With Opus 4.7's 98.5%, it approaches the precision needed for operational use.
The improvement is not just in the raw accuracy rate. Opus 4.7 supports images up to 2,576 pixels on the long side, equivalent to about 3.75 megapixels — more than three times the limit of previous Claude models. This means A4 documents scanned at high resolution, whiteboard photos, complex interface screenshots and detailed diagram images can be processed without needing prior resolution reduction.
For those working with physical documents or historical archives not natively digitized, this change transforms Opus 4.7 from experimental to operational tool.
Paper invoices and administrative documents: the most immediate use case
Processing paper invoices is one of the most labor-intensive business processes in any organization. Many companies — especially SMEs and companies with suppliers that haven't yet adopted electronic invoicing — still handle a significant volume of scanned paper documents.
With Opus 4.6, results on data extraction from scanned invoices were variable: non-standard layouts, unusual fonts, low scan quality or document rotation frequently generated errors in extracting amounts, dates, tax codes or item descriptions. An error rate of 45% — implied in the 54.5% accuracy on XBOW — required systematic human review of all documents.
With Opus 4.7's 98.5%, the workflow changes. Human review isn't eliminated for critical documents — an invoice with a wrong amount has real consequences — but systematic review can be reduced to sampling review on at-risk documents. For companies processing hundreds of invoices per month, the time saving is measurable.
A practical workflow: scan the document, send to Opus 4.7 with a structured prompt for extracting relevant fields (supplier, date, amount, VAT, description), output in structured JSON, automatic insertion into the management system with flags for human review on low-confidence cases. This type of integration requires custom development, but the article on how to integrate Claude in your company offers a practical starting point.
Want to automate document processing with Claude Opus 4.7?
30 minutes to discuss your specific case.
Scanned contracts and legal archives: the impact for law firms
For law firms and corporate legal departments, a significant portion of document work involves historical contracts never natively digitized — paper archives of years or decades of activity. Searching these archives, extracting specific clauses or building contractual timelines were exclusively manual activities.
Opus 4.7 with 98.5% visual acuity makes automated analysis of these scanned archives viable. Combined with the 1 million token context window, a long scanned contract — even dozens of pages — can be analyzed entirely in a single call, with relevant clauses extracted and categorized.
The most immediate use cases for the legal sector include: reviewing historical contracts to identify clauses requiring updates (for regulatory alignment, for example), building clause databases from historical archives, searching for contractual precedents on specific conditions, and due diligence on acquisition target archives. For a specific look at using Claude in the legal sector, the Claude AI for law firms article is the most complete reference.
A critical element to consider: the 98.5% visual acuity is a benchmark on a test dataset. Contracts with particularly irregular writing, faded ink or physical damage may produce lower results. Validation on your specific archive before operational deployment remains necessary.
Technical diagrams, chemical structures and blueprints: specialized applications
Anthropic explicitly identifies three categories of specialized visual documents as primary use cases for Opus 4.7's vision capabilities: chemical structures, technical diagrams and engineering blueprints.
Chemical structures are graphical representations of molecules and compounds following a standardized notation (SMILES notation, Lewis structures, skeletal formulas). Correctly interpreting a chemical structure requires not just reading the text, but understanding the topology of the molecular graph — a task requiring genuine visual understanding, not just OCR. For pharmaceutical, chemical or scientific research companies, this capability opens automation possibilities on tasks previously not scalable.
Technical diagrams — P&ID (Piping and Instrumentation Diagrams), electrical schemas, process flow diagrams — are fundamental in manufacturing, energy and utilities industries. The ability to automatically interpret them translates into possibilities for comparative analysis, automatic documentation updates and compliance verification.
Architectural and engineering blueprints are relevant in real estate, construction and plant engineering. Automatic extraction of floor areas, technical specifications and layouts from project documents is a task with a broad application market.
For all these specialized use cases, Opus 4.7 is the technical starting point — but the real value depends on integration with existing systems and processes. Maverick AI supports the design of these workflows for industrial, legal and pharmaceutical sectors.
Limitations and validation: what cannot yet be taken for granted
The 98.5% accuracy on XBOW is a relevant result, but it must be interpreted correctly. Every benchmark tests a specific dataset — and the real documents your organization processes may differ from the benchmark dataset in format, scan quality, language, layout and content type.
Factors that can reduce accuracy below the benchmark: low scan quality (below 150 DPI is problematic), documents with faded ink or physical damage, handwriting with very irregular script, complex tabular layouts with many merged cells, documents in languages with non-Latin writing systems. In these cases, actual accuracy may be below 98.5%.
The operational suggestion is to build a pipeline with confidence scores: Opus 4.7 can return not only the extracted output but also an estimate of its confidence in the interpretation. Low-confidence documents should be routed to human review; high-confidence ones can proceed automatically. This hybrid approach is more robust than full automation on all documents.
For organizations evaluating this type of workflow, the starting point is an audit of the typical documents to be processed — type, volume, average scan quality — before designing the architecture. If you are evaluating Claude for document processing in your organization, the Maverick AI team can guide you through the technical evaluation.