How Centara's multi-model OCR pipeline transforms unstructured documents into fully structured, routable electronic invoices — automatically.

AI Extraction Pipeline

Centara’s OCR engine transforms unstructured inbound email attachments into fully structured, routable electronic documents — automatically. Powered by a coordinated ensemble of AI models, the pipeline combines document intelligence, large language models, and vision AI to achieve extraction accuracy that no single model can match alone.

Note: Structured formats (PEPPOL BIS 3.0, EDI, XML) bypass this pipeline entirely — their fields are parsed directly from the message. The AI pipeline handles PDFs, scans, email attachments, and other unstructured inputs.

Pipeline overview

StepModelRole
1. Pre-screeningMistral MediumFast classification and deduplication
2. Structural extractionAzure Document IntelligenceReliable baseline with per-field confidence scores
3. PDF renderingMistral OCRHigh-fidelity text and layout extraction to markdown
4. Deep understandingClaude Opus 4.6Nuanced extraction, vendor discovery, classification
5. Fallback extractionMistral MediumSafety net if primary extraction fails
6. Vendor resolutionPriority-based engineIdentity matching across multiple sources
7. Conversion & routingCentara pipelineStructured document enters standard processing

Step 1: Pre-screening & smart deduplication

Before any expensive AI call is made, every incoming attachment is fingerprinted with a SHA-256 hash and checked against a registry of known documents.

  • Previously-rejected files — logos, signatures, letterheads — are instantly discarded without touching an API
  • New documents pass through Mistral Medium for rapid classification: Is this a processable invoice? A multi-page bundle? Or noise to be filtered?
  • Multi-document PDFs are automatically detected and split by page range using Gotenberg, creating individual processing records for each invoice within the bundle

This step eliminates wasted processing on non-invoice content and catches duplicates before they enter the pipeline.

Step 2: Azure Document Intelligence — structural extraction

Every document is submitted to Azure Document Intelligence (prebuilt-invoice model), which extracts structured fields with per-field confidence scores:

  • Vendor identity and customer details
  • Line items with descriptions, quantities, and amounts
  • Tax amounts and categories
  • Dates (invoice date, due date, delivery date)
  • Addresses and contact information

This layer provides a reliable, always-on structural baseline that downstream models build on. Even when other models improve or change, Azure Document Intelligence ensures a consistent extraction foundation.

Step 3: Mistral OCR — PDF-to-markdown conversion

The PDF is simultaneously passed through Mistral OCR, which renders the document into rich, page-structured markdown — preserving the layout, tables, and text formatting that matter most for invoice understanding.

  • The markdown output is cached from the pre-screening phase when available, eliminating redundant processing
  • Table structures, column alignment, and hierarchical content are preserved
  • Multi-page documents maintain page boundaries

If Mistral OCR encounters an issue, the pipeline gracefully falls back to the Azure-extracted JSON, ensuring extraction continues without interruption.

Step 4: Claude Opus 4.6 — the intelligence layer

The heart of the extraction engine. Claude receives the Mistral markdown pages alongside the Azure-structured data and produces a unified, deeply-structured extraction:

What Claude extracts

  • Invoice type classification — Invoice, Credit Memo, Sales Order, Customs List, Delivery Note
  • All header fields — dates, invoice numbers, amounts, currencies, payment terms
  • Full line-item breakdown — per-line quantity, unit price, tax rates, descriptions
  • Vendor identity discovery — company registration numbers, VAT numbers, and national IDs extracted from the document itself, not just from a known database

Why Claude is the primary extractor

Claude’s output is the primary extraction result. Unlike template-based or rule-based extraction systems, Claude:

  • Understands document context and intent, not just field positions
  • Handles invoices in any language without language-specific configuration
  • Interprets ambiguous fields (e.g., distinguishing a PO reference from an invoice number)
  • Normalizes vendor names and maps them to Business Central entities
  • Adapts to new invoice layouts without retraining

Cost optimization: Prompt caching keeps costs low on repeated document patterns. Automatic retry with exponential backoff handles transient API issues transparently.

Step 5: Intelligent fallback — Mistral Medium

If Claude’s JSON output is unparseable for any reason, Mistral Medium steps in as a fallback extraction layer — independently parsing the same source material and returning a valid structured result.

No document gets stuck due to a single model hiccup. The fallback is transparent: the document enters the same downstream pipeline regardless of which model produced the final extraction.

Step 6: Vendor resolution & party discovery

With extraction complete, a priority-based identity resolution engine identifies the sending vendor:

PrioritySourceConfidence
1LLM-discovered company IDHighest
2LLM-discovered VAT numberHigh
3Azure-extracted identifiersMedium
4Vendor name matchLower
5Email domain pseudo-identifierLast resort

Country-specific logic

The vendor resolution engine includes built-in support for national identifier schemes:

  • Iceland — kennitala parsing, distinguishing company IDs (6-digit + hyphen + 4-digit) from personal IDs
  • Sweden — organisationsnummer format recognition
  • Norway, Denmark, Finland — national organization number schemes
  • Other countries handled via standard VAT number and company registration formats

Step 7: Conversion & routing

The resolved invoice is converted to a structured Invoice or Order record — complete with line items, tax categories, and validated totals — and enters the standard Centara routing pipeline:

  1. Smart routing via Vendor Posting Setup assigns GL accounts, dimensions, and cost centres
  2. Purchase order matching compares against open POs
  3. Approval workflow routes for review (or auto-approves if rules are met)
  4. Auto-post to Business Central

Audit trail

Every step is recorded in a rich audit trail of timestamped events:

  • Pre-screening result (accepted/rejected/split, with reason)
  • Azure Document Intelligence confidence scores per field
  • Mistral OCR processing status
  • Claude extraction result (full JSON)
  • Fallback activation (if triggered)
  • Vendor resolution path (which source matched)
  • Conversion result

This gives full traceability from raw email attachment to delivered document — essential for compliance and debugging.

Why multi-model?

No single AI model is best at everything:

ChallengeBest model
Fast classification and filteringMistral Medium
Reliable structural field extractionAzure Document Intelligence
High-fidelity text and layout from PDFsMistral OCR
Nuanced understanding and vendor discoveryClaude Opus 4.6

By combining purpose-built document AI with frontier LLMs — each covering the other’s blind spots — the pipeline achieves extraction quality that’s robust across invoice formats, languages, and edge cases.

The result: more invoices processed automatically, fewer exceptions requiring human review, and a system that gets smarter as the underlying models improve.

Supported input formats

SourceFormatHow it enters
Email attachmentPDF, image (PNG/JPG/TIFF)Forwarded to Centara inbox address
Portal uploadPDF, imageUploaded via CentaraIQ portal
Scanned paperPDF (from scanner)Uploaded or emailed
E-commerceStructured order dataAPI from Shopify, WooCommerce
APIXML, JSONDirect push to CentaraIQ API