What data do I need for AI?
Quick Answer
The data you need depends on your AI use case. For retrieval-augmented generation (RAG) systems, you need a well-organised knowledge base of documents, policies, or procedures. For predictive models, you need historical records with clear outcome labels. Quality matters more than quantity: clean, consistent, representative data produces better results than large volumes of messy data.
Summary
Key takeaways
- Data requirements vary significantly by AI use case and approach
- Quality and relevance matter more than raw volume for most applications
- RAG-based systems can work with existing document repositories
- Data preparation typically consumes 40-60% of total AI project effort
Data Requirements by AI Application Type
Data Quality Over Quantity
FAQ
Frequently asked questions
AI can tolerate some data quality issues, but performance degrades with poor data. Data cleaning and preparation should be factored into any AI project plan. Some approaches like RAG are more tolerant of imperfect data than traditional machine learning models.
This varies by use case. RAG systems can start with a few hundred documents. Simple classification may need 50 to 100 labelled examples per category. Complex predictive models may require thousands of historical records. Start with what you have and assess whether more is needed.
Not necessarily for initial projects. AI consultancies typically handle data preparation as part of the engagement. For ongoing AI operations, having someone with data engineering skills becomes important to maintain data pipelines and quality.
Many AI approaches work with limited data. RAG systems work with existing documents, pre-trained models require minimal custom data, and few-shot learning techniques can deliver results from small numbers of examples. Start with what you have and assess whether more is needed.
Audit a representative sample across five dimensions: accuracy, completeness, consistency, timeliness, and relevance. Identify the percentage of records with issues in each dimension. This assessment informs the data preparation effort needed before starting an AI project.
Have more questions about AI?
Our team can help you navigate the AI landscape. Book a free strategy call.