How Centara's multi-model OCR pipeline transforms unstructured documents into fully structured, routable electronic invoices — automatically.
AI Extraction Pipeline
Centara’s OCR engine transforms unstructured inbound email attachments into fully structured, routable electronic documents — automatically. Powered by a coordinated ensemble of AI models, the pipeline combines document intelligence, large language models, and vision AI to achieve extraction accuracy that no single model can match alone.
Note: Structured formats (PEPPOL BIS 3.0, EDI, XML) bypass this pipeline entirely — their fields are parsed directly from the message. The AI pipeline handles PDFs, scans, email attachments, and other unstructured inputs.
Pipeline overview
| Step | Model | Role |
|---|---|---|
| 1. Pre-screening | Mistral Medium | Fast classification and deduplication |
| 2. Structural extraction | Azure Document Intelligence | Reliable baseline with per-field confidence scores |
| 3. PDF rendering | Mistral OCR | High-fidelity text and layout extraction to markdown |
| 4. Deep understanding | Claude Opus 4.6 | Nuanced extraction, vendor discovery, classification |
| 5. Fallback extraction | Mistral Medium | Safety net if primary extraction fails |
| 6. Vendor resolution | Priority-based engine | Identity matching across multiple sources |
| 7. Conversion & routing | Centara pipeline | Structured document enters standard processing |
Step 1: Pre-screening & smart deduplication
Before any expensive AI call is made, every incoming attachment is fingerprinted with a SHA-256 hash and checked against a registry of known documents.
- Previously-rejected files — logos, signatures, letterheads — are instantly discarded without touching an API
- New documents pass through Mistral Medium for rapid classification: Is this a processable invoice? A multi-page bundle? Or noise to be filtered?
- Multi-document PDFs are automatically detected and split by page range using Gotenberg, creating individual processing records for each invoice within the bundle
This step eliminates wasted processing on non-invoice content and catches duplicates before they enter the pipeline.
Step 2: Azure Document Intelligence — structural extraction
Every document is submitted to Azure Document Intelligence (prebuilt-invoice model), which extracts structured fields with per-field confidence scores:
- Vendor identity and customer details
- Line items with descriptions, quantities, and amounts
- Tax amounts and categories
- Dates (invoice date, due date, delivery date)
- Addresses and contact information
This layer provides a reliable, always-on structural baseline that downstream models build on. Even when other models improve or change, Azure Document Intelligence ensures a consistent extraction foundation.
Step 3: Mistral OCR — PDF-to-markdown conversion
The PDF is simultaneously passed through Mistral OCR, which renders the document into rich, page-structured markdown — preserving the layout, tables, and text formatting that matter most for invoice understanding.
- The markdown output is cached from the pre-screening phase when available, eliminating redundant processing
- Table structures, column alignment, and hierarchical content are preserved
- Multi-page documents maintain page boundaries
If Mistral OCR encounters an issue, the pipeline gracefully falls back to the Azure-extracted JSON, ensuring extraction continues without interruption.
Step 4: Claude Opus 4.6 — the intelligence layer
The heart of the extraction engine. Claude receives the Mistral markdown pages alongside the Azure-structured data and produces a unified, deeply-structured extraction:
What Claude extracts
- Invoice type classification — Invoice, Credit Memo, Sales Order, Customs List, Delivery Note
- All header fields — dates, invoice numbers, amounts, currencies, payment terms
- Full line-item breakdown — per-line quantity, unit price, tax rates, descriptions
- Vendor identity discovery — company registration numbers, VAT numbers, and national IDs extracted from the document itself, not just from a known database
Why Claude is the primary extractor
Claude’s output is the primary extraction result. Unlike template-based or rule-based extraction systems, Claude:
- Understands document context and intent, not just field positions
- Handles invoices in any language without language-specific configuration
- Interprets ambiguous fields (e.g., distinguishing a PO reference from an invoice number)
- Normalizes vendor names and maps them to Business Central entities
- Adapts to new invoice layouts without retraining
Cost optimization: Prompt caching keeps costs low on repeated document patterns. Automatic retry with exponential backoff handles transient API issues transparently.
Step 5: Intelligent fallback — Mistral Medium
If Claude’s JSON output is unparseable for any reason, Mistral Medium steps in as a fallback extraction layer — independently parsing the same source material and returning a valid structured result.
No document gets stuck due to a single model hiccup. The fallback is transparent: the document enters the same downstream pipeline regardless of which model produced the final extraction.
Step 6: Vendor resolution & party discovery
With extraction complete, a priority-based identity resolution engine identifies the sending vendor:
| Priority | Source | Confidence |
|---|---|---|
| 1 | LLM-discovered company ID | Highest |
| 2 | LLM-discovered VAT number | High |
| 3 | Azure-extracted identifiers | Medium |
| 4 | Vendor name match | Lower |
| 5 | Email domain pseudo-identifier | Last resort |
Country-specific logic
The vendor resolution engine includes built-in support for national identifier schemes:
- Iceland — kennitala parsing, distinguishing company IDs (6-digit + hyphen + 4-digit) from personal IDs
- Sweden — organisationsnummer format recognition
- Norway, Denmark, Finland — national organization number schemes
- Other countries handled via standard VAT number and company registration formats
Step 7: Conversion & routing
The resolved invoice is converted to a structured Invoice or Order record — complete with line items, tax categories, and validated totals — and enters the standard Centara routing pipeline:
- Smart routing via Vendor Posting Setup assigns GL accounts, dimensions, and cost centres
- Purchase order matching compares against open POs
- Approval workflow routes for review (or auto-approves if rules are met)
- Auto-post to Business Central
Audit trail
Every step is recorded in a rich audit trail of timestamped events:
- Pre-screening result (accepted/rejected/split, with reason)
- Azure Document Intelligence confidence scores per field
- Mistral OCR processing status
- Claude extraction result (full JSON)
- Fallback activation (if triggered)
- Vendor resolution path (which source matched)
- Conversion result
This gives full traceability from raw email attachment to delivered document — essential for compliance and debugging.
Why multi-model?
No single AI model is best at everything:
| Challenge | Best model |
|---|---|
| Fast classification and filtering | Mistral Medium |
| Reliable structural field extraction | Azure Document Intelligence |
| High-fidelity text and layout from PDFs | Mistral OCR |
| Nuanced understanding and vendor discovery | Claude Opus 4.6 |
By combining purpose-built document AI with frontier LLMs — each covering the other’s blind spots — the pipeline achieves extraction quality that’s robust across invoice formats, languages, and edge cases.
The result: more invoices processed automatically, fewer exceptions requiring human review, and a system that gets smarter as the underlying models improve.
Supported input formats
| Source | Format | How it enters |
|---|---|---|
| Email attachment | PDF, image (PNG/JPG/TIFF) | Forwarded to Centara inbox address |
| Portal upload | PDF, image | Uploaded via CentaraIQ portal |
| Scanned paper | PDF (from scanner) | Uploaded or emailed |
| E-commerce | Structured order data | API from Shopify, WooCommerce |
| API | XML, JSON | Direct push to CentaraIQ API |