What is AI inference?
Quick Answer
AI inference is the process of using a trained model to generate outputs, whether predictions, classifications, text, or decisions, from new input data. It is the production phase of AI where models deliver value, as opposed to training where models learn. Inference costs, speed, and reliability are critical production concerns that directly affect your AI system's operational effectiveness and total cost of ownership.
Summary
Key takeaways
- Inference is the production phase where trained models generate useful outputs
- Speed, cost, and reliability are the key operational concerns for inference
- Inference can run on cloud APIs, dedicated servers, or edge devices
- Optimisation techniques can significantly reduce inference costs and latency
Understanding AI Inference
Optimising Inference for Production
FAQ
Frequently asked questions
Cloud API inference costs vary by model: GPT-4o costs approximately $2.50 per million input tokens, while smaller models cost 10-50x less. Local inference has no per-query cost but requires hardware investment. Volume discounts and model optimisation can significantly reduce costs.
Large language model inference typically takes 1 to 30 seconds depending on model size and output length. Classification and prediction models respond in milliseconds. Smaller models and optimised deployments reduce latency significantly.
Yes. Open-source models can run on your own GPUs using tools like vLLM, Ollama, or text-generation-inference. This provides data privacy and predictable costs at scale, though it requires hardware investment and technical management.
Training is the process of teaching a model by adjusting its parameters using large datasets, which is computationally expensive and done infrequently. Inference is the process of using the trained model to generate outputs from new inputs, which happens with every user interaction.
Yes. Techniques include caching frequent responses, batching requests, using model quantisation to reduce computation, implementing prompt compression, and routing simple queries to smaller models. These optimisations can reduce costs by 30-70% without affecting output quality.
Have more questions about AI?
Our team can help you navigate the AI landscape. Book a free strategy call.