AI Document Processing Examples
Practical examples of using AI to extract, classify, and process documents — from invoices and contracts to medical records and regulatory filings.
Invoice Data Extraction Pipeline
intermediateAn end-to-end pipeline that ingests invoices in PDF, image, or email format, extracts key fields (vendor, amount, line items, dates), validates against purchase orders, and pushes structured data to the accounting system.
// Invoice extraction pipeline
async function processInvoice(document) {
// Step 1: Convert to text (OCR if needed)
const text = document.type === "pdf"
? await extractPdfText(document)
: await runOCR(document);
// Step 2: Extract structured fields with LLM
const extracted = await llm.chat({
messages: [{
role: "system",
content: "Extract invoice fields as JSON: vendor, invoiceNumber, date, lineItems[], subtotal, tax, total, paymentTerms"
}, { role: "user", content: text }],
responseFormat: { type: "json_object" }
});
// Step 3: Validate against PO
const po = await findMatchingPO(extracted.vendor, extracted.total);
return { ...extracted, poMatch: po?.id, confidence: po ? "high" : "review" };
}Key takeaway: Combining OCR with LLM extraction achieves 95%+ accuracy on structured documents — far exceeding traditional template-based approaches.
Contract Clause Detection and Risk Scoring
advancedA system that analyses legal contracts to identify key clauses (indemnity, liability caps, termination, IP assignment), flags risky language, and generates a risk summary for legal review. Trained on company-specific risk criteria.
// Contract risk analysis
const riskCriteria = {
unlimitedLiability: { severity: "high", pattern: "unlimited liability" },
autoRenewal: { severity: "medium", pattern: "automatic renewal" },
ipAssignment: { severity: "high", pattern: "assign.*intellectual property" },
nonCompete: { severity: "medium", pattern: "non-compete|restrictive covenant" },
};
async function analyseContract(contractText) {
const clauses = await llm.chat({
messages: [{
role: "system",
content: `Identify and classify all clauses in this contract. For each clause, provide: type, text, riskLevel (low/medium/high), and explanation.`
}, { role: "user", content: contractText }],
responseFormat: { type: "json_object" }
});
const riskScore = clauses.reduce((sum, c) => sum + riskWeights[c.riskLevel], 0);
return { clauses, riskScore, requiresReview: riskScore > threshold };
}Key takeaway: AI contract review reduces first-pass review time by 80% — but always keep human lawyers for final sign-off on high-value agreements.
Medical Record Summarisation
advancedProcesses patient medical records to extract diagnoses, medications, lab results, and care history into a structured timeline. Handles handwritten notes via OCR and reconciles information across multiple document sources.
// Medical record summarisation
async function summarisePatientRecord(documents) {
const timeline = [];
for (const doc of documents) {
const text = doc.handwritten ? await medicalOCR(doc) : await extractText(doc);
const extracted = await llm.chat({
messages: [{
role: "system",
content: "Extract: date, provider, diagnoses[], medications[], labResults[], procedures[], notes. Use SNOMED codes where possible."
}, { role: "user", content: text }]
});
timeline.push({ ...extracted, sourceDoc: doc.id });
}
// Reconcile and deduplicate across sources
return reconcileTimeline(timeline);
}Key takeaway: Medical document processing requires strict data governance — process on-premises or in a healthcare-compliant cloud environment, never send patient data to public APIs.
Automated Expense Report Processing
beginnerProcesses expense receipts submitted via email or mobile app, categorises spending, checks against company policy limits, and routes for approval. Handles multiple currencies and receipt formats.
// Expense receipt processing
async function processExpenseReceipt(receipt) {
const extracted = await llm.extractStructured(receipt.image, {
fields: ["merchant", "date", "amount", "currency", "category", "items"]
});
// Check against policy
const policy = await getPolicyLimits(receipt.submitterId);
const violations = [];
if (extracted.amount > policy.maxSingleExpense) {
violations.push("Exceeds single expense limit");
}
if (!policy.allowedCategories.includes(extracted.category)) {
violations.push("Category not in approved list");
}
return {
...extracted,
policyCompliant: violations.length === 0,
violations,
approvalRoute: violations.length > 0 ? "manager" : "auto"
};
}Key takeaway: Automated expense processing pays for itself quickly — most companies see ROI in under 2 months from reduced manual processing time and faster reimbursements.
Regulatory Filing Document Assembly
advancedAutomates the assembly of regulatory filings by pulling data from multiple internal systems, populating templates, cross-referencing prior submissions for consistency, and flagging discrepancies for human review.
// Regulatory filing assembly
async function assembleRegFiling(filingType, period) {
// Gather data from multiple sources
const [financials, riskData, complianceNotes] = await Promise.all([
fetchFinancialData(period),
fetchRiskAssessments(period),
fetchComplianceNotes(period),
]);
// Check consistency with prior filings
const priorFiling = await getPriorFiling(filingType, previousPeriod);
const inconsistencies = await llm.chat({
messages: [{
role: "system",
content: "Compare current data with prior filing. Flag any material changes or inconsistencies that need explanation."
}, { role: "user", content: JSON.stringify({ current: financials, prior: priorFiling }) }]
});
// Populate template
const filing = await populateTemplate(filingType, { financials, riskData, complianceNotes });
return { filing, inconsistencies, requiresReview: inconsistencies.length > 0 };
}Key takeaway: Document assembly is a high-value AI use case because errors in regulatory filings carry real penalties — AI catches inconsistencies humans miss under time pressure.
Patterns
Key patterns to follow
- Combine OCR with LLM extraction for best results — OCR alone misses context, LLMs alone cannot read images
- Always include a human review step for high-stakes documents (legal, medical, financial)
- Structured output formats (JSON schemas) dramatically improve extraction reliability
- Validation against existing data (POs, policies, prior filings) catches errors that pure extraction misses
- On-premises or private cloud processing is essential for sensitive documents
FAQ
Frequently asked questions
Modern AI extraction achieves 90-98% accuracy on structured documents like invoices and receipts. Accuracy drops for handwritten text (80-90%) and complex layouts. Always implement confidence scoring and human review for low-confidence extractions.
Yes, but with lower accuracy than printed text. Specialised medical and legal OCR models handle domain-specific handwriting better than general-purpose OCR. Expect 80-90% accuracy on clear handwriting, less for poor handwriting.
Most systems handle PDF, images (JPEG, PNG, TIFF), Microsoft Office formats (DOCX, XLSX), and email (EML, MSG). Some handle scanned documents via OCR. The key is building a robust ingestion pipeline that normalises formats before processing.
Use on-premises or private cloud deployment, implement data encryption at rest and in transit, apply access controls, maintain audit logs, and ensure compliance with relevant regulations (GDPR, HIPAA, SOX). Never send sensitive documents to public AI APIs without proper data processing agreements.
Typical ROI ranges from 200-500% in the first year, driven by reduced manual processing time (60-80% reduction), fewer errors, faster turnaround, and improved compliance. Most organisations see payback within 3-6 months.
Need custom AI implementation?
Our team can help you build production-ready AI solutions. Book a free strategy call.