If you've ever spoken to a data scientist, you've heard the refrain: data preparation is 80% of the work. Having delivered dozens of AI projects across industries, we can confirm this is not an exaggeration. The difference between a working AI system and an expensive failure almost always comes down to the quality and shape of the data going in.
Yet most businesses underestimate the effort involved. They assume their existing data is "good enough" or that AI will somehow work around gaps and inconsistencies. It won't. Here's what you actually need to do.
Why Data Prep Is 80% of the Work
Modern AI models are remarkably capable, but they're only as good as the data they consume. A state-of-the-art language model fed inconsistent, duplicate, or poorly structured data will produce inconsistent, duplicate, and poorly structured outputs. No amount of prompt engineering will fix fundamentally broken inputs.
Data preparation is so dominant because it encompasses everything from initial collection and cleaning through to formatting, enrichment, and validation. Each step introduces its own challenges. A customer database might have three different formats for phone numbers. Product descriptions might mix languages. Financial records might have gaps spanning entire quarters. These are not edge cases — they are the norm.
The upside is significant: get your data right and the AI implementation itself becomes dramatically simpler. We've seen projects where a two-week data cleaning sprint turned a failing prototype into a production-ready system overnight.
Common Data Quality Issues
After years of working with enterprise data, we see the same problems repeatedly. Being aware of them before you start saves enormous amounts of time.
- Inconsistent formatting: Dates in multiple formats, addresses with varying structures, names stored as "First Last" in one system and "Last, First" in another. These seem trivial but cause significant issues at scale.
- Missing values: Fields that are technically required but routinely left blank. The question is not whether you have missing data, but how much and whether it's systematic (which biases your model) or random (which is more manageable).
- Duplicates: The same entity represented multiple times, often with slight variations. Customer "John Smith" and "J. Smith" at the same address are probably the same person, but your system doesn't know that.
- Stale data: Records that were accurate when created but haven't been updated. A model trained on outdated pricing or discontinued products will confidently give wrong answers.
- Siloed data: Relevant information trapped in separate systems with no easy way to join them. Sales data in one CRM, support tickets in another, and product information in a spreadsheet on someone's desktop.
Cleaning Strategies That Actually Work
Data cleaning is not glamorous, but it is where most of the value is created. Here are the strategies we use consistently.
Start with profiling: Before cleaning anything, understand what you have. Run statistical profiles on every field — distributions, null rates, cardinality, outliers. This tells you where to focus your effort. Tools like Great Expectations or even simple Pandas profiling scripts can do this in minutes.
Standardise early: Pick canonical formats for dates, currencies, names, and addresses, then enforce them. This is tedious but prevents cascading issues later. Automated standardisation rules handle 90% of cases; the remaining 10% need manual review.
Deduplicate intelligently: Simple exact-match deduplication catches obvious duplicates, but fuzzy matching (using techniques like Levenshtein distance or embedding similarity) catches the rest. We typically use a two-pass approach: automated matching followed by human review of uncertain matches.
Handle missing data deliberately: Don't just drop rows with missing values. Understand why they're missing. If a field is missing 60% of the time, it's probably not a useful feature. If it's missing 5% of the time, imputation or simply excluding those records may be appropriate. The key is making an explicit decision rather than letting your tooling handle it silently.
Labelling Approaches and When to Use Synthetic Data
For supervised learning and evaluation, you need labelled data. This is often the most expensive part of data preparation, but there are ways to manage costs.
Manual labelling remains the gold standard for quality. For classification tasks, have at least two annotators label each example independently, then measure inter-annotator agreement. If your annotators can't agree, your model won't be able to learn the task either — which is a useful signal.
Semi-automated labelling uses a model to generate initial labels, which humans then review and correct. This is typically 3-5x faster than manual labelling from scratch, and the quality is surprisingly good when paired with a well-designed review interface.
Synthetic data has matured significantly. For tasks where real data is scarce, sensitive, or expensive to collect, generating synthetic training examples using large language models can bootstrap your dataset. We use this approach regularly for document extraction and classification tasks. The key is validating synthetic data against a small set of real examples to ensure it captures genuine patterns rather than hallucinated ones.
A practical rule of thumb: start with 200-500 high-quality labelled examples. Train a baseline model, evaluate it, and then decide whether you need more data or whether your model's errors point to a different problem entirely (often they point to labelling inconsistency rather than data volume).
Building Data Pipelines for Production
One-off data cleaning is necessary but insufficient. For AI systems that operate continuously, you need pipelines that clean and validate data as it flows through.
A production data pipeline for AI typically includes these stages:
- Ingestion: Pull data from source systems via APIs, database connections, or file drops. Establish clear contracts about format and frequency.
- Validation: Check incoming data against expected schemas and quality rules. Reject or quarantine records that fail validation rather than letting bad data propagate.
- Transformation: Standardise formats, enrich records with derived fields, and join data from multiple sources into a unified representation.
- Quality monitoring: Track data quality metrics over time. Set alerts for drift — if the distribution of incoming data shifts significantly, your model may need retraining.
- Versioning: Track changes to your training data and processing logic. When model performance degrades, you need to understand whether the data changed, the pipeline changed, or the world changed.
You don't need to build all of this from day one. Start with validation and transformation, then add monitoring and versioning as your system matures. The important thing is designing your pipeline so these capabilities can be added incrementally.
For most of our clients, we recommend starting with a simple orchestration tool like Prefect or Airflow for batch pipelines, paired with schema validation at ingestion. This covers 80% of requirements without over-engineering. For our full approach, see our data preparation services.
Data preparation is not the exciting part of AI, but it is the part that determines whether your project succeeds or fails. If you're not sure where your data stands or what it will take to get it ready, get in touch for a free data assessment and we'll give you an honest picture of what's needed.