Implementation Guides
Step-by-step guides to optimize your AI costs and performance
Each guide includes detailed instructions, code examples, and best practices to help you implement cost-saving strategies.
Edge Proxy
Learn how to implement an edge proxy for AI APIs: route traffic, balance load, enforce policies, and cut latency. This Onaro™ guide covers architecture patterns, provider configuration, and safe rollout for high-volume OpenAI and Anthropic workloads.
Best for: Organizations with >100K API calls/month
Circuit Breakers
Add circuit breakers around LLM calls to stop cascading failures, shed load during outages, and avoid runaway spend when APIs degrade. Step-by-step patterns for retries, fallbacks, and observability in production AI systems.
Best for: Production systems with high availability requirements
Semantic Caching
Implement semantic caching so similar prompts hit a cache instead of the model—often cutting API cost dramatically. Covers embeddings, similarity thresholds, invalidation, and when caching is safe for your use case.
Best for: Applications with repetitive or similar queries
Model Switching
Route tasks to the right model tier: cheap models for simple work, premium models where quality matters. Practical routing rules, evaluation tips, and examples to lower spend without surprising regressions.
Best for: Multi-task AI applications
Prompt Compression
Compress prompts and context to cut token usage 30–50% while preserving answer quality: summarization, structured extraction, trimming policies, and measurement so savings show up in your real traffic.
Best for: Applications with long context windows
Response Streaming
Stream model responses to users for snappier UX without raising token cost. Covers SSE patterns, client handling, backpressure, and provider-specific streaming options for chat and agent interfaces.
Best for: All user-facing AI applications
Batch Processing
Batch LLM and embedding jobs to unlock provider batch discounts and simpler rate limits. When to batch, how to chunk inputs, idempotency, and monitoring so throughput goes up and per-token cost goes down.
Best for: Applications with bulk processing needs