Overview eMessaging Warehouse LS Central Wise Approvals Wise EDI Island.is Help

How Centara's multi-model OCR pipeline transforms unstructured documents into fully structured, routable electronic invoices — automatically.

AI Extraction Pipeline

Centara’s OCR engine transforms unstructured inbound email attachments into fully structured, routable electronic documents — automatically. Powered by a coordinated ensemble of AI models, the pipeline combines document intelligence, large language models, and vision AI to achieve extraction accuracy that no single model can match alone.

Note: Structured formats (PEPPOL BIS 3.0, EDI, XML) bypass this pipeline entirely — their fields are parsed directly from the message. The AI pipeline handles PDFs, scans, email attachments, and other unstructured inputs.

Pipeline overview

Step	Model	Role
1. Pre-screening	Mistral Medium	Fast classification and deduplication
2. Structural extraction	Azure Document Intelligence	Reliable baseline with per-field confidence scores
3. PDF rendering	Mistral OCR	High-fidelity text and layout extraction to markdown
4. Deep understanding	Claude Opus 4.6	Nuanced extraction, vendor discovery, classification
5. Fallback extraction	Mistral Medium	Safety net if primary extraction fails
6. Vendor resolution	Priority-based engine	Identity matching across multiple sources
7. Conversion & routing	Centara pipeline	Structured document enters standard processing

Step 1: Pre-screening & smart deduplication

Before any expensive AI call is made, every incoming attachment is fingerprinted with a SHA-256 hash and checked against a registry of known documents.

Previously-rejected files — logos, signatures, letterheads — are instantly discarded without touching an API
New documents pass through Mistral Medium for rapid classification: Is this a processable invoice? A multi-page bundle? Or noise to be filtered?
Multi-document PDFs are automatically detected and split by page range using Gotenberg, creating individual processing records for each invoice within the bundle

This step eliminates wasted processing on non-invoice content and catches duplicates before they enter the pipeline.

Step 2: Azure Document Intelligence — structural extraction

Every document is submitted to Azure Document Intelligence (prebuilt-invoice model), which extracts structured fields with per-field confidence scores:

Vendor identity and customer details
Line items with descriptions, quantities, and amounts
Tax amounts and categories
Dates (invoice date, due date, delivery date)
Addresses and contact information

This layer provides a reliable, always-on structural baseline that downstream models build on. Even when other models improve or change, Azure Document Intelligence ensures a consistent extraction foundation.

Step 3: Mistral OCR — PDF-to-markdown conversion

The PDF is simultaneously passed through Mistral OCR, which renders the document into rich, page-structured markdown — preserving the layout, tables, and text formatting that matter most for invoice understanding.

The markdown output is cached from the pre-screening phase when available, eliminating redundant processing
Table structures, column alignment, and hierarchical content are preserved
Multi-page documents maintain page boundaries

If Mistral OCR encounters an issue, the pipeline gracefully falls back to the Azure-extracted JSON, ensuring extraction continues without interruption.

Step 4: Claude Opus 4.6 — the intelligence layer

The heart of the extraction engine. Claude receives the Mistral markdown pages alongside the Azure-structured data and produces a unified, deeply-structured extraction:

What Claude extracts

Invoice type classification — Invoice, Credit Memo, Sales Order, Customs List, Delivery Note
All header fields — dates, invoice numbers, amounts, currencies, payment terms
Full line-item breakdown — per-line quantity, unit price, tax rates, descriptions
Vendor identity discovery — company registration numbers, VAT numbers, and national IDs extracted from the document itself, not just from a known database

Why Claude is the primary extractor

Claude’s output is the primary extraction result. Unlike template-based or rule-based extraction systems, Claude:

Understands document context and intent, not just field positions
Handles invoices in any language without language-specific configuration
Interprets ambiguous fields (e.g., distinguishing a PO reference from an invoice number)
Normalizes vendor names and maps them to Business Central entities
Adapts to new invoice layouts without retraining

Cost optimization: Prompt caching keeps costs low on repeated document patterns. Automatic retry with exponential backoff handles transient API issues transparently.

Step 5: Intelligent fallback — Mistral Medium

If Claude’s JSON output is unparseable for any reason, Mistral Medium steps in as a fallback extraction layer — independently parsing the same source material and returning a valid structured result.

No document gets stuck due to a single model hiccup. The fallback is transparent: the document enters the same downstream pipeline regardless of which model produced the final extraction.

Step 6: Vendor resolution & party discovery

With extraction complete, a priority-based identity resolution engine identifies the sending vendor:

Priority	Source	Confidence
1	LLM-discovered company ID	Highest
2	LLM-discovered VAT number	High
3	Azure-extracted identifiers	Medium
4	Vendor name match	Lower
5	Email domain pseudo-identifier	Last resort

Country-specific logic

The vendor resolution engine includes built-in support for national identifier schemes:

Iceland — kennitala parsing, distinguishing company IDs (6-digit + hyphen + 4-digit) from personal IDs
Sweden — organisationsnummer format recognition
Norway, Denmark, Finland — national organization number schemes
Other countries handled via standard VAT number and company registration formats

Step 7: Conversion & routing

The resolved invoice is converted to a structured Invoice or Order record — complete with line items, tax categories, and validated totals — and enters the standard Centara routing pipeline:

Smart routing via Vendor Posting Setup assigns GL accounts, dimensions, and cost centres
Purchase order matching compares against open POs
Approval workflow routes for review (or auto-approves if rules are met)
Auto-post to Business Central

Audit trail

Every step is recorded in a rich audit trail of timestamped events:

Pre-screening result (accepted/rejected/split, with reason)
Azure Document Intelligence confidence scores per field
Mistral OCR processing status
Claude extraction result (full JSON)
Fallback activation (if triggered)
Vendor resolution path (which source matched)
Conversion result

This gives full traceability from raw email attachment to delivered document — essential for compliance and debugging.

Why multi-model?

No single AI model is best at everything:

Challenge	Best model
Fast classification and filtering	Mistral Medium
Reliable structural field extraction	Azure Document Intelligence
High-fidelity text and layout from PDFs	Mistral OCR
Nuanced understanding and vendor discovery	Claude Opus 4.6

By combining purpose-built document AI with frontier LLMs — each covering the other’s blind spots — the pipeline achieves extraction quality that’s robust across invoice formats, languages, and edge cases.

The result: more invoices processed automatically, fewer exceptions requiring human review, and a system that gets smarter as the underlying models improve.

Supported input formats

Source	Format	How it enters
Email attachment	PDF, image (PNG/JPG/TIFF)	Forwarded to Centara inbox address
Portal upload	PDF, image	Uploaded via CentaraIQ portal
Scanned paper	PDF (from scanner)	Uploaded or emailed
E-commerce	Structured order data	API from Shopify, WooCommerce
API	XML, JSON	Direct push to CentaraIQ API

Posting Documents

Purchase Document Worksheet