GroveAI
strategy

What data do I need for AI?

Quick Answer

The data you need depends on your AI use case. For retrieval-augmented generation (RAG) systems, you need a well-organised knowledge base of documents, policies, or procedures. For predictive models, you need historical records with clear outcome labels. Quality matters more than quantity: clean, consistent, representative data produces better results than large volumes of messy data.

Summary

Key takeaways

  • Data requirements vary significantly by AI use case and approach
  • Quality and relevance matter more than raw volume for most applications
  • RAG-based systems can work with existing document repositories
  • Data preparation typically consumes 40-60% of total AI project effort

Data Requirements by AI Application Type

Different AI applications have different data requirements. RAG systems, which power knowledge bases and intelligent assistants, work with your existing documents, policies, and procedures. They require well-structured, up-to-date content but do not need labelled training data. Document processing AI needs a representative sample of the documents you want to process, typically 100 to 500 examples covering the variety you encounter. Predictive models need historical data with clear outcomes, such as past sales figures for demand forecasting or historical maintenance records for predictive maintenance. Classification and categorisation systems need labelled examples showing the categories you want the AI to recognise. Text generation and summarisation can leverage pre-trained models with minimal custom data, especially when using prompt engineering techniques.

Data Quality Over Quantity

The most common misconception about AI data is that you need enormous volumes. In reality, data quality is far more important. Clean, accurate, and representative data produces dramatically better results than larger volumes of noisy, incomplete, or biased data. Key quality dimensions include accuracy, ensuring data correctly represents reality; completeness, minimising missing values and gaps; consistency, maintaining uniform formats and standards; timeliness, using data that reflects current conditions; and representativeness, ensuring your data covers the full range of scenarios the AI will encounter. Before starting an AI project, conduct a data audit to assess these dimensions and identify remediation work needed.

FAQ

Frequently asked questions

AI can tolerate some data quality issues, but performance degrades with poor data. Data cleaning and preparation should be factored into any AI project plan. Some approaches like RAG are more tolerant of imperfect data than traditional machine learning models.

This varies by use case. RAG systems can start with a few hundred documents. Simple classification may need 50 to 100 labelled examples per category. Complex predictive models may require thousands of historical records. Start with what you have and assess whether more is needed.

Not necessarily for initial projects. AI consultancies typically handle data preparation as part of the engagement. For ongoing AI operations, having someone with data engineering skills becomes important to maintain data pipelines and quality.

Many AI approaches work with limited data. RAG systems work with existing documents, pre-trained models require minimal custom data, and few-shot learning techniques can deliver results from small numbers of examples. Start with what you have and assess whether more is needed.

Audit a representative sample across five dimensions: accuracy, completeness, consistency, timeliness, and relevance. Identify the percentage of records with issues in each dimension. This assessment informs the data preparation effort needed before starting an AI project.

Have more questions about AI?

Our team can help you navigate the AI landscape. Book a free strategy call.