GroveAI
Examples

AI Data Extraction Examples

Examples of using AI to extract structured data from unstructured sources — websites, PDFs, emails, images, and free-text documents — with validation and quality assurance patterns.

Web Page Data Extraction Pipeline

intermediate

A pipeline that scrapes web pages, uses AI to identify and extract specific data points (prices, specifications, contact information), structures the output as JSON, and validates against expected schemas.

// AI web data extraction
async function extractFromWebPage(url, schema) {
  const html = await fetchPage(url);
  const cleanText = stripHTML(html);

  const extracted = await llm.chat({
    messages: [{
      role: "system",
      content: `Extract data matching this schema from the page content: ${JSON.stringify(schema)}. Return valid JSON.`
    }, { role: "user", content: cleanText }],
    responseFormat: { type: "json_object" }
  });

  // Validate against schema
  const validation = validateSchema(extracted, schema);
  if (!validation.valid) {
    return { data: extracted, errors: validation.errors, confidence: "low" };
  }

  return { data: extracted, confidence: "high" };
}

Key takeaway: AI extraction is more resilient to website layout changes than CSS-selector-based scraping — it understands content semantics rather than relying on DOM structure.

Business Card and Contact Information Extraction

beginner

An image-based extraction system that processes photos of business cards, extracts contact details (name, title, company, phone, email, address), and creates structured contact records in the CRM.

// Business card extraction
async function extractBusinessCard(image) {
  const contact = await llm.chat({
    messages: [{
      role: "system",
      content: "Extract contact information from this business card image. Return JSON with: name, title, company, email, phone, address, website."
    }],
    images: [image],
    responseFormat: { type: "json_object" }
  });

  // Validate email format, phone number format
  const validated = validateContactFields(contact);

  // Check for duplicates in CRM
  const duplicate = await crm.findDuplicate(validated.email, validated.phone);

  return { contact: validated, isDuplicate: !!duplicate, existingRecord: duplicate };
}

Key takeaway: Multi-modal AI models that process images directly outperform OCR-then-extract pipelines for business card extraction.

Financial Statement Data Extraction

advanced

Extracts line items, totals, and key ratios from financial statements (balance sheets, income statements, cash flow statements) in PDF format. Handles various layouts and formats from different reporting standards.

// Financial statement extraction
async function extractFinancials(pdfDocument) {
  const pages = await pdfToText(pdfDocument);

  const financials = await llm.chat({
    messages: [{
      role: "system",
      content: `Extract financial data from this statement. Return JSON with:
      - statementType (balance_sheet | income_statement | cash_flow)
      - period, currency
      - lineItems: [{ name, value, category }]
      - totals: { revenue, netIncome, totalAssets, etc. }`
    }, { role: "user", content: pages.join("\n") }],
    responseFormat: { type: "json_object" }
  });

  // Cross-validation: do the line items sum to the stated totals?
  const validated = crossValidateFinancials(financials);

  return { data: financials, validationErrors: validated.errors, confidence: validated.confidence };
}

Key takeaway: Financial extraction requires domain-specific validation — cross-check extracted totals against calculated sums to catch errors.

Resume and CV Parsing System

beginner

Extracts structured candidate information from resumes in various formats (PDF, DOCX, plain text). Identifies skills, experience, education, and certifications regardless of layout or format variations.

// Resume parsing
async function parseResume(document) {
  const text = await extractText(document);

  const parsed = await llm.chat({
    messages: [{
      role: "system",
      content: `Parse this resume into structured data:
      { name, email, phone, location, summary,
        experience: [{ company, title, startDate, endDate, responsibilities[] }],
        education: [{ institution, degree, field, year }],
        skills: [], certifications: [] }`
    }, { role: "user", content: text }],
    responseFormat: { type: "json_object" }
  });

  // Normalise skills against taxonomy
  parsed.skills = normaliseSkills(parsed.skills);

  return parsed;
}

Key takeaway: AI resume parsing handles creative and non-standard formats that break template-based parsers — critical for reducing bias in candidate screening.

Multi-Source Data Aggregation and Reconciliation

advanced

A pipeline that extracts related data from multiple sources (database, spreadsheets, PDFs, emails), reconciles discrepancies, and produces a unified dataset. Uses AI to resolve conflicts and fill gaps.

// Multi-source data reconciliation
async function reconcileData(sources) {
  // Extract from each source
  const datasets = await Promise.all(
    sources.map(async (source) => ({
      source: source.name,
      data: await extractFromSource(source),
    }))
  );

  // Find conflicts
  const conflicts = findConflicts(datasets);

  // Resolve conflicts with AI
  const resolutions = await Promise.all(
    conflicts.map(async (conflict) => {
      const resolution = await llm.chat({
        messages: [{
          role: "system",
          content: "Resolve this data conflict. Consider source reliability, recency, and consistency."
        }, { role: "user", content: JSON.stringify(conflict) }]
      });
      return { conflict, resolution, requiresReview: resolution.confidence < 0.8 };
    })
  );

  return { unified: mergeDatasets(datasets, resolutions), conflicts: resolutions };
}

Key takeaway: AI reconciliation across multiple sources catches data quality issues that single-source extraction misses — the real value is in finding and resolving conflicts.

Receipt and Expense Data Extraction

beginner

Processes photos of receipts to extract merchant, date, items, amounts, tax, and total. Handles crumpled receipts, partial text, and various formats. Categorises expenses automatically based on merchant and items.

// Receipt extraction
async function extractReceipt(receiptImage) {
  const data = await llm.chat({
    messages: [{
      role: "system",
      content: `Extract receipt data. Return JSON:
      { merchant, date, items: [{ name, quantity, price }], subtotal, tax, total, currency, paymentMethod }
      If any field is unclear, set it to null.`
    }],
    images: [receiptImage],
    responseFormat: { type: "json_object" }
  });

  // Auto-categorise
  data.category = await categoriseExpense(data.merchant, data.items);

  // Validate totals
  const calculatedTotal = data.items.reduce((sum, i) => sum + i.price * i.quantity, 0);
  data.totalValidated = Math.abs(calculatedTotal - data.subtotal) < 0.01;

  return data;
}

Key takeaway: Receipt extraction accuracy improves dramatically when you provide the model with expected categories and typical merchant names for your use case.

Patterns

Key patterns to follow

  • Always validate extracted data against schemas and cross-check calculated fields against stated totals
  • Multi-modal models that process images directly often outperform OCR-then-extract pipelines
  • Providing expected categories and field types in the extraction prompt significantly improves accuracy
  • Confidence scoring and human review for low-confidence extractions is essential for production systems
  • Multi-source reconciliation adds more value than single-source extraction alone

FAQ

Frequently asked questions

Accuracy ranges from 85% for complex unstructured documents to 98% for well-structured formats. Key factors are document quality, consistency of format, and specificity of extraction instructions. Always implement validation and human review for business-critical data.

Yes, multi-modal AI models can process handwritten text directly from images. Accuracy depends on handwriting legibility — clear handwriting achieves 85-90% accuracy, while poor handwriting may drop to 60-70%. For critical documents, always include human verification.

Build a format detection layer that identifies file types and routes to appropriate processors. For PDFs, use text extraction or OCR. For images, use multi-modal models directly. For Office formats, convert to text first. The AI extraction layer can be format-agnostic if pre-processing handles the conversion.

The technology itself is neutral — compliance depends on how you implement it. Ensure you have lawful basis for processing, implement data minimisation, use secure processing environments, maintain audit logs, and handle data retention properly. Use on-premises or compliant cloud solutions for personal data.

Costs range from £0.01-0.10 per document for simple extraction to £0.50-2.00 for complex multi-page documents requiring multi-modal processing. At scale, batch processing and model optimisation can reduce costs by 50-70%.

Need custom AI implementation?

Our team can help you build production-ready AI solutions. Book a free strategy call.