AI Data Audit Template
A structured template for auditing the quality, completeness, and governance of data that will feed AI models. Helps data teams identify and remediate quality issues before they become model performance problems.
Overview
What's included
Data Inventory
Data Inventory
Audit scope: Audit date: Auditor(s): AI use case:
Data Sources
| # | Dataset Name | Source System | Format | Size | Records | Update Frequency | Owner |
|---|---|---|---|---|---|---|---|
| 1 | CSV/DB/API/JSON | GB | rows | Real-time/Daily/Weekly/Static | |||
| 2 | GB | rows | |||||
| 3 | GB | rows | |||||
| 4 | GB | rows |
Data Lineage
For each dataset, document where the data comes from and how it is transformed:
Dataset 1:
- Origin:
- Transformations applied:
- Joins/merges with:
- Known limitations:
Data Access
| Dataset | Access Method | Authentication | Latency | Rate Limits |
|---|---|---|---|---|
| API / DB query / File | ms | |||
| ms |
Quality Assessment
Data Quality Assessment
Rate each dimension from 1 (Poor) to 5 (Excellent) for each dataset:
Completeness
Are all expected records and fields present?
| Dataset | Total Records | Expected Records | Missing Records (%) | Null Fields (%) | Score (1-5) |
|---|---|---|---|---|---|
| % | % | ||||
| % | % |
Accuracy
Are the values correct and trustworthy?
| Dataset | Sample Size | Errors Found | Error Rate (%) | Validation Method | Score (1-5) |
|---|---|---|---|---|---|
| % | |||||
| % |
Consistency
Are values consistent across systems and over time?
| Dataset | Duplicate Records (%) | Format Inconsistencies | Cross-system Discrepancies | Score (1-5) |
|---|---|---|---|---|
| % |
Timeliness
Is data fresh enough for the AI use case?
| Dataset | Required Freshness | Actual Freshness | Lag | Score (1-5) |
|---|---|---|---|---|
| < hours | hours | hours |
Governance
Is data properly governed and documented?
| Dataset | Owner Defined | Documentation | Privacy Classification | Consent Basis | Score (1-5) |
|---|---|---|---|---|---|
| Yes/No | Complete/Partial/None | Public/Internal/Confidential/Personal |
Remediation Plan
Remediation Plan
Issues Found
| # | Dataset | Issue | Severity | Impact on AI | Remediation Action | Owner | Deadline | Status |
|---|---|---|---|---|---|---|---|---|
| 1 | Critical/High/Medium/Low | Open | ||||||
| 2 | Open | |||||||
| 3 | Open | |||||||
| 4 | Open | |||||||
| 5 | Open |
Ongoing Monitoring
| Check | Frequency | Automated? | Alert Threshold | Owner |
|---|---|---|---|---|
| Completeness (null rate) | Daily | Yes/No | > % nulls | |
| Freshness (data lag) | Hourly | Yes/No | > hours | |
| Volume (record count) | Daily | Yes/No | +/- % from baseline | |
| Schema changes | On deployment | Yes/No | Any change | |
| Duplicate rate | Weekly | Yes/No | > % |
Audit Summary
| Dimension | Average Score | Status |
|---|---|---|
| Completeness | /5 | Red/Amber/Green |
| Accuracy | /5 | Red/Amber/Green |
| Consistency | /5 | Red/Amber/Green |
| Timeliness | /5 | Red/Amber/Green |
| Governance | /5 | Red/Amber/Green |
| Overall | ___/5 | ___ |
Recommendation: Proceed / Proceed with conditions / Remediate first Conditions (if applicable):
Instructions
How to use this template
Inventory all data sources
List every dataset that will feed the AI model. Include source system, format, size, and update frequency.
Assess quality across five dimensions
Work through completeness, accuracy, consistency, timeliness, and governance for each dataset. Use quantitative measures wherever possible.
Prioritise issues by AI impact
Not all quality issues affect AI equally. Focus remediation on issues that will directly impact model performance.
Set up automated monitoring
Implement automated quality checks that run daily. Catching issues early prevents model degradation in production.
Watch Out
Common mistakes to avoid
FAQ
Frequently asked questions
A focused audit for a single AI use case typically takes 1-2 weeks. An organisation-wide data quality audit can take 4-8 weeks depending on the number of data sources.
There is no universal threshold, but aim for: less than 5% missing values in critical fields, less than 2% error rate, and consistent formatting. The required level depends on your use case — medical AI needs near-perfect data; a recommendation engine can tolerate more noise.
Fix the source whenever possible. Cleaning data downstream is a temporary fix that needs to be repeated with every data refresh. Improving data quality at the source provides permanent benefits.
For unstructured data, focus on: completeness (are all expected documents present?), quality (are documents readable and not corrupted?), metadata accuracy (are labels and tags correct?), and representativeness (does the data cover the full range of expected inputs?).
Many checks can be automated: null rates, duplicate detection, schema validation, freshness monitoring, and statistical distribution checks. Human review is still needed for accuracy assessment and governance evaluation.
Need a custom AI template?
Our team can build tailored templates for your specific business needs. Book a free strategy call.