AI Model Evaluation Template
A systematic framework for evaluating and comparing AI models across accuracy, cost, latency, and fairness dimensions. Helps engineering teams make evidence-based model selection decisions rather than relying on general benchmarks or marketing claims.
Overview
What's included
Evaluation Setup
Evaluation Setup
Use case: Evaluation date: Evaluator(s):
Models Being Evaluated
| Model | Provider | Version | API Endpoint | Notes |
|---|---|---|---|---|
Evaluation Criteria
Rank these criteria by importance for your use case (1 = most important):
| # | Criterion | Weight | Minimum Threshold |
|---|---|---|---|
| Output quality / accuracy | % | ||
| Latency (response time) | % | < ms (p95) | |
| Cost per request | % | < £ /1k requests | |
| Context window size | % | > tokens | |
| Instruction following | % | ||
| Safety / content filtering | % | ||
| Multilingual support | % | ||
| Fine-tuning availability | % | ||
| Total | 100% |
Test Dataset
- Size: test cases
- Source: (production samples / manually crafted / synthetic)
- Categories covered:
- Happy path scenarios
- Edge cases
- Adversarial inputs
- Domain-specific terminology
- Multi-step reasoning
- Ground truth: (human-labelled / expert-reviewed)
Quality & Accuracy Assessment
Quality & Accuracy Assessment
Scoring Rubric
For each test case, score the model output on a 1-5 scale:
| Score | Label | Definition |
|---|---|---|
| 5 | Excellent | Correct, complete, well-formatted, no issues |
| 4 | Good | Correct with minor formatting or verbosity issues |
| 3 | Acceptable | Mostly correct but missing some detail or slightly off |
| 2 | Poor | Partially correct but significant errors or omissions |
| 1 | Failed | Incorrect, hallucinated, or did not follow instructions |
Quality Results
| Test Category | # Cases | Model A Avg | Model B Avg | Model C Avg |
|---|---|---|---|---|
| Happy path | /5 | /5 | /5 | |
| Edge cases | /5 | /5 | /5 | |
| Adversarial | /5 | /5 | /5 | |
| Domain-specific | /5 | /5 | /5 | |
| Multi-step | /5 | /5 | /5 | |
| Overall | ___ | ___/5 | ___/5 | ___/5 |
Error Analysis
For each model, categorise the errors found:
| Error Type | Model A | Model B | Model C |
|---|---|---|---|
| Hallucination (made up facts) | cases | cases | cases |
| Instruction not followed | cases | cases | cases |
| Incomplete response | cases | cases | cases |
| Format/structure error | cases | cases | cases |
| Factual error | cases | cases | cases |
| Refused valid request | cases | cases | cases |
Cost & Latency Benchmarks
Cost & Latency Benchmarks
Pricing Comparison
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Avg Tokens/Request | Cost per Request |
|---|---|---|---|---|
| £ | £ | In: Out: | £ | |
| £ | £ | In: Out: | £ | |
| £ | £ | In: Out: | £ |
Monthly Cost Projection
| Volume | Model A | Model B | Model C |
|---|---|---|---|
| 1,000 requests/day | £ /month | £ /month | £ /month |
| 10,000 requests/day | £ /month | £ /month | £ /month |
| 100,000 requests/day | £ /month | £ /month | £ /month |
Latency Results
Measured over requests per model:
| Metric | Model A | Model B | Model C |
|---|---|---|---|
| p50 latency | ms | ms | ms |
| p95 latency | ms | ms | ms |
| p99 latency | ms | ms | ms |
| Time to first token | ms | ms | ms |
| Throughput (tokens/sec) | |||
| Error rate | % | % | % |
| Rate limit (requests/min) |
Final Model Scorecard
Final Model Scorecard
Weighted Comparison
| Criterion | Weight | Model A Score | Model A Weighted | Model B Score | Model B Weighted | Model C Score | Model C Weighted |
|---|---|---|---|---|---|---|---|
| Quality/Accuracy | % | /5 | /5 | /5 | |||
| Latency | % | /5 | /5 | /5 | |||
| Cost | % | /5 | /5 | /5 | |||
| Context window | % | /5 | /5 | /5 | |||
| Safety | % | /5 | /5 | /5 | |||
| Total | 100% | ___ | ___ | ___ |
Recommendation
Selected model:
Rationale:
Caveats / Conditions:
Re-evaluation trigger: Re-evaluate if:
- Accuracy drops below % in production
- A new model version is released by
- Monthly cost exceeds £
- Requirements change significantly
Instructions
How to use this template
Build a representative test dataset
Use real production queries or craft synthetic ones that cover your full range of use cases. Include easy, moderate, and hard examples.
Define pass/fail criteria upfront
Set minimum thresholds for quality, latency, and cost before running evaluations. This prevents rationalising poor results after the fact.
Run evaluations in parallel
Send the same test cases to all models simultaneously to ensure fair comparison. Control for variables like time of day and API load.
Have multiple evaluators score quality
Use at least 2 human evaluators (or a combination of human and LLM-as-judge) to reduce scoring bias. Average scores across evaluators.
Watch Out
Common mistakes to avoid
FAQ
Frequently asked questions
Aim for at least 50-100 test cases, with more for critical use cases. The cases should cover the full distribution of your production queries, including edge cases and adversarial inputs.
Both. Use LLM-as-judge (e.g. GPT-4 evaluating model outputs) for scalable initial screening and human evaluation for final quality assessment and calibration. LLM judges are cheaper and faster but can miss subtle quality issues.
Re-evaluate when: a new model version is released by your provider, a major competitor launches a new model, your use case changes significantly, or production metrics show degradation. At minimum, plan a quarterly review.
Consider a routing approach: use a cheaper/faster model for simple requests and a more capable model for complex ones. Alternatively, fine-tune a smaller model for your specific domain to close performance gaps.
Yes, if you have the infrastructure to deploy them. Open-source models can offer better cost economics at scale and more control over data privacy. Include total cost of ownership (hosting, maintenance) in the comparison.
Need a custom AI template?
Our team can build tailored templates for your specific business needs. Book a free strategy call.