Semantic Caching
Cache similar queries to reduce API costs by up to 80%
Best For: Applications with repetitive or similar queries
Semantic Caching Implementation Guide
Cache AI Responses to Save 30-50% on Duplicate Queries
Difficulty: Beginner to Intermediate
Time Required: 1-3 hours
Potential Savings: $1,500-5,000/month (30-50% reduction on duplicate queries)
Best For: Applications with repeated questions (FAQ bots, documentation search, customer support)
What is Semantic Caching?
Traditional caching only works for exact matches:
Query: "What is your return policy?"
Cache: HIT β
Query: "What is your refund policy?"
Cache: MISS β (even though it's the same question!)
Semantic caching understands meaning:
Query: "What is your return policy?"
Cache: HIT β
Query: "What is your refund policy?"
Cache: HIT β (semantically similar, same answer!)
Query: "How do I return an item?"
Cache: HIT β (also similar!)
How It Works:
- Convert query to embedding (vector representation)
- Check if similar query exists in cache (using cosine similarity)
- If similar enough (>90%) β Return cached response
- If not similar β Call AI, cache the new response
Result: 30-50% of queries are cache hits, saving API costs and reducing latency.
Why You Need This
Cost Savings Example:
Before Caching:
- 100,000 requests/month
- Average cost: $0.01 per request
- Total: $1,000/month
After Semantic Caching:
- 40% cache hit rate
- 40,000 cached requests: $0/month
- 60,000 API requests: $600/month
- Total: $600/month (40% savings)
Additional Benefits:
- 10x faster responses (cache: 10ms, API: 500ms)
- Reduced API rate limits (fewer actual API calls)
- Better user experience (instant responses)
- More consistent answers (same question = same answer)
Prerequisites
Before implementing:
- Redis or similar cache store (for storing embeddings + responses)
- Embedding model access (OpenAI, Anthropic, or local model)
- Python 3.8+ (for code examples)
- Basic understanding of vector similarity
Recommended Setup:
- Redis with vector search (RedisStack or Redis Cloud)
- OpenAI
text-embedding-3-small(cheap, fast embeddings)
Implementation Steps
Step 1: Choose Your Caching Approach
Option A: Use LiteLLM Proxy (Recommended if you have Edge Proxy)
Semantic caching built-in, just configure.
Option B: Application-Level Cache (For direct integrations)
Implement caching in your application code.
Option C: Managed Service (GPTCache, Helicone)
Hosted caching service, no infrastructure.
We'll cover Options A and B in this guide.
Option A: Semantic Caching in LiteLLM Proxy
If you have an Edge Proxy (see Edge Proxy Implementation Guide), semantic caching is built-in.
Step 1: Install Redis
Using Docker:
# Pull Redis with vector search support docker pull redis/redis-stack:latest # Run Redis docker run -d \ --name redis-cache \ -p 6379:6379 \ redis/redis-stack:latest
Or use Redis Cloud (free tier):
- Sign up at https://redis.com/try-free/
- Get connection string:
redis://username:password@host:port
Step 2: Configure Semantic Caching
Update your litellm_config.yaml:
model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY # Semantic caching configuration cache_config: type: redis # Redis connection redis_host: localhost redis_port: 6379 # redis_password: os.environ/REDIS_PASSWORD # If password protected # Semantic similarity settings similarity_threshold: 0.9 # 90% similarity = cache hit # Embedding model for semantic similarity embedding_model: text-embedding-3-small # OpenAI's cheapest embedding model embedding_api_key: os.environ/OPENAI_API_KEY # Cache TTL (time to live) ttl: 3600 # Cache for 1 hour (adjust based on content freshness needs) # What to cache cache_responses: true # Cache LLM responses cache_embeddings: true # Cache query embeddings (faster lookups) # Logging log_cache_hits: true # Log when cache is hit (for monitoring) # Router settings router_settings: routing_strategy: least-cost
Step 3: Restart Proxy
docker restart litellm-proxy # Verify caching is enabled curl http://localhost:4000/health | grep cache # Should show: "cache": "enabled"
Step 4: Test Semantic Caching
First request (cache miss):
time curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is your return policy?"}] }' # Response time: ~500ms # Response headers: X-LiteLLM-Cache: MISS
Second request (semantically similar, cache hit):
time curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is your refund policy?"}] }' # Response time: ~10ms (50x faster!) # Response headers: X-LiteLLM-Cache: HIT # Same response as first request
Third request (different topic, cache miss):
curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is the capital of France?"}] }' # Response headers: X-LiteLLM-Cache: MISS # Different topic, not similar enough to cached queries
Option B: Application-Level Semantic Cache
If you're making direct API calls, implement semantic caching in your application.
Step 1: Install Dependencies
pip install redis openai numpy scikit-learn --break-system-packages
Step 2: Create Semantic Cache Class
Create semantic_cache.py:
import hashlib import json import logging import numpy as np import redis from openai import OpenAI from sklearn.metrics.pairwise import cosine_similarity from typing import Optional, Dict, Any, List logger = logging.getLogger(__name__) class SemanticCache: """ Semantic caching for AI responses using vector similarity. How it works: 1. Convert query to embedding (vector) 2. Check if similar query exists in cache 3. If similar (>90% similarity), return cached response 4. Otherwise, call AI and cache new response """ def __init__( self, redis_client: redis.Redis, embedding_model: str = "text-embedding-3-small", similarity_threshold: float = 0.9, ttl: int = 3600, # 1 hour default openai_api_key: Optional[str] = None ): self.redis = redis_client self.embedding_model = embedding_model self.similarity_threshold = similarity_threshold self.ttl = ttl self.openai_client = OpenAI(api_key=openai_api_key) # Cache statistics self.hits = 0 self.misses = 0 def get_embedding(self, text: str) -> List[float]: """Convert text to embedding vector""" try: response = self.openai_client.embeddings.create( model=self.embedding_model, input=text ) return response.data[0].embedding except Exception as e: logger.error(f"Failed to get embedding: {e}") return [] def _create_cache_key(self, query: str, model: str) -> str: """Create unique cache key for query+model combination""" combined = f"{model}:{query}" return f"semantic_cache:{hashlib.sha256(combined.encode()).hexdigest()}" def _find_similar_cached_query( self, query_embedding: List[float], model: str ) -> Optional[Dict[str, Any]]: """ Search cache for similar queries using vector similarity. Returns cached response if similarity > threshold. """ # Get all cached queries for this model pattern = f"semantic_cache:*" cached_keys = self.redis.keys(pattern) if not cached_keys: return None query_vec = np.array(query_embedding).reshape(1, -1) best_similarity = 0 best_match = None for key in cached_keys: try: cached_data = json.loads(self.redis.get(key)) # Only compare queries for same model if cached_data.get('model') != model: continue # Get cached query embedding cached_embedding = cached_data.get('embedding') if not cached_embedding: continue # Calculate cosine similarity cached_vec = np.array(cached_embedding).reshape(1, -1) similarity = cosine_similarity(query_vec, cached_vec)[0][0] # Track best match if similarity > best_similarity: best_similarity = similarity best_match = cached_data logger.debug(f"Similarity to cached query: {similarity:.4f}") except Exception as e: logger.warning(f"Error checking cached key {key}: {e}") continue # Return if similarity exceeds threshold if best_similarity >= self.similarity_threshold: logger.info(f"Cache HIT (similarity: {best_similarity:.4f})") return best_match logger.info(f"Cache MISS (best similarity: {best_similarity:.4f})") return None def get( self, query: str, model: str, **kwargs ) -> Optional[Dict[str, Any]]: """ Check cache for semantically similar query. Returns cached response if found, None otherwise. """ # Get query embedding query_embedding = self.get_embedding(query) if not query_embedding: logger.warning("Failed to get query embedding, skipping cache") return None # Search for similar cached query cached_result = self._find_similar_cached_query(query_embedding, model) if cached_result: self.hits += 1 return { 'response': cached_result['response'], 'cached': True, 'cache_key': cached_result.get('cache_key'), 'similarity': cached_result.get('similarity', 1.0) } self.misses += 1 return None def set( self, query: str, model: str, response: str, **kwargs ): """ Cache AI response with query embedding for semantic lookup. """ # Get query embedding query_embedding = self.get_embedding(query) if not query_embedding: logger.warning("Failed to get query embedding, skipping cache") return # Create cache entry cache_key = self._create_cache_key(query, model) cache_data = { 'query': query, 'model': model, 'response': response, 'embedding': query_embedding, 'cache_key': cache_key, 'cached_at': str(datetime.utcnow()) } # Store in Redis with TTL try: self.redis.setex( cache_key, self.ttl, json.dumps(cache_data) ) logger.info(f"Cached response for query: {query[:50]}...") except Exception as e: logger.error(f"Failed to cache response: {e}") def invalidate(self, pattern: str = "*"): """Clear cache entries matching pattern""" keys = self.redis.keys(f"semantic_cache:{pattern}") if keys: self.redis.delete(*keys) logger.info(f"Invalidated {len(keys)} cache entries") def get_stats(self) -> Dict[str, Any]: """Get cache statistics""" total = self.hits + self.misses hit_rate = (self.hits / total * 100) if total > 0 else 0 return { 'hits': self.hits, 'misses': self.misses, 'total_requests': total, 'hit_rate_pct': round(hit_rate, 2), 'cache_size': len(self.redis.keys('semantic_cache:*')) } # Example usage wrapper class CachedAIProvider: """Wrapper that adds semantic caching to AI provider calls""" def __init__(self, cache: SemanticCache, openai_api_key: str): self.cache = cache self.client = OpenAI(api_key=openai_api_key) def chat_completion( self, messages: List[Dict[str, str]], model: str = "gpt-4o-mini", **kwargs ) -> Dict[str, Any]: """ Chat completion with automatic semantic caching. Checks cache first, calls API if cache miss. """ # Extract user query (last message) query = messages[-1]['content'] if messages else "" # Check cache cached_response = self.cache.get(query, model) if cached_response: return { 'content': cached_response['response'], 'model': model, 'cached': True, 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0} } # Cache miss - call API logger.info(f"Cache miss, calling {model}") response = self.client.chat.completions.create( model=model, messages=messages, **kwargs ) content = response.choices[0].message.content # Cache the response self.cache.set(query, model, content) return { 'content': content, 'model': model, 'cached': False, 'usage': { 'prompt_tokens': response.usage.prompt_tokens, 'completion_tokens': response.usage.completion_tokens, 'total_tokens': response.usage.total_tokens } } # Singleton instances redis_client = redis.Redis( host='localhost', port=6379, decode_responses=False # We'll handle JSON encoding ) semantic_cache = SemanticCache( redis_client=redis_client, similarity_threshold=0.9, ttl=3600 # 1 hour ) cached_ai = CachedAIProvider( cache=semantic_cache, openai_api_key=os.environ.get('OPENAI_API_KEY') )
Step 3: Update Your Application Code
Before (No caching):
import openai client = openai.OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "What is your return policy?"}] ) print(response.choices[0].message.content)
After (With semantic caching):
from semantic_cache import cached_ai # First call - cache miss, calls API response1 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is your return policy?"}], model="gpt-4o-mini" ) print(f"Response 1 (cached: {response1['cached']}): {response1['content']}") # Output: Response 1 (cached: False): [AI response...] # Second call - semantically similar, cache hit! response2 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is your refund policy?"}], model="gpt-4o-mini" ) print(f"Response 2 (cached: {response2['cached']}): {response2['content']}") # Output: Response 2 (cached: True): [Same response, 50x faster!] # Different query - cache miss response3 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is the capital of France?"}], model="gpt-4o-mini" ) print(f"Response 3 (cached: {response3['cached']}): {response3['content']}") # Output: Response 3 (cached: False): [New AI response]
Step 4: Monitor Cache Performance
# Get cache statistics stats = semantic_cache.get_stats() print(f""" Cache Performance: - Hits: {stats['hits']} - Misses: {stats['misses']} - Hit Rate: {stats['hit_rate_pct']}% - Cache Size: {stats['cache_size']} entries """) # Expected after 1000 requests in a support bot: # Hits: 400 # Misses: 600 # Hit Rate: 40% # Cache Size: 150 entries (unique questions)
Advanced Configuration
1. Adjust Similarity Threshold
# Conservative (fewer cache hits, more accurate) cache = SemanticCache(similarity_threshold=0.95) # 95% similar # Aggressive (more cache hits, may be less accurate) cache = SemanticCache(similarity_threshold=0.85) # 85% similar # Recommended for most use cases cache = SemanticCache(similarity_threshold=0.9) # 90% similar
How to choose:
- 0.95+ - FAQ bots where exact answers matter
- 0.90 - General customer support
- 0.85 - Documentation search where close matches are ok
2. Different TTL by Use Case
# Short TTL for time-sensitive content cache_news = SemanticCache(ttl=300) # 5 minutes # Long TTL for static content cache_docs = SemanticCache(ttl=86400) # 24 hours # No expiration for permanent content cache_faq = SemanticCache(ttl=None) # Never expires
3. Namespace Caches by Context
# Separate caches for different parts of your app cache_support = SemanticCache(redis_client, namespace="support") cache_sales = SemanticCache(redis_client, namespace="sales") cache_internal = SemanticCache(redis_client, namespace="internal") # Each has independent cache storage
4. Exclude Certain Queries from Caching
def should_cache(query: str, model: str) -> bool: """Decide if query should be cached""" # Don't cache personalized queries if any(word in query.lower() for word in ['my account', 'my order', 'my name']): return False # Don't cache creative generation if 'write a' in query.lower() or 'generate' in query.lower(): return False # Don't cache very long queries (embeddings are expensive) if len(query) > 1000: return False return True # Use in your code: if should_cache(query, model): cached = cache.get(query, model)
5. Cache Warming
Pre-populate cache with common queries:
# common_queries.txt common_queries = [ "What is your return policy?", "How do I track my order?", "What are your business hours?", "Do you ship internationally?", # ... 50 more common questions ] # Warm the cache for query in common_queries: response = cached_ai.chat_completion( messages=[{"role": "user", "content": query}], model="gpt-4o-mini" ) print(f"Cached: {query[:50]}...") print(f"Cache warmed with {len(common_queries)} queries")
Monitoring & Optimization
Key Metrics to Track
-
Cache Hit Rate:
- Target: 30-50% for support bots
- Target: 60-80% for FAQ bots
- Target: 10-20% for general chat
-
Cost Savings:
savings = (cache_hits * avg_cost_per_request) savings_pct = (cache_hits / total_requests * 100) -
Latency Improvement:
- Cache hit: ~10ms
- API call: ~500ms
- Improvement: 50x faster
-
Cache Size:
- Monitor growth over time
- Set limits if needed
- Evict old entries (LRU policy)
Dashboard Example
from flask import Flask, jsonify from semantic_cache import semantic_cache app = Flask(__name__) @app.route('/cache/stats') def cache_stats(): stats = semantic_cache.get_stats() # Calculate cost savings avg_cost = 0.0001 # $0.0001 per request cost_saved = stats['hits'] * avg_cost return jsonify({ **stats, 'cost_saved_usd': round(cost_saved, 4), 'estimated_monthly_savings': round(cost_saved * 30, 2) }) @app.route('/cache/top-queries') def top_queries(): """Show most frequently cached queries""" # Get all cache keys keys = redis_client.keys('semantic_cache:*') queries = [] for key in keys[:50]: # Top 50 data = json.loads(redis_client.get(key)) queries.append({ 'query': data['query'][:100], 'model': data['model'], 'cached_at': data['cached_at'] }) return jsonify(queries)
Testing Checklist
Before deploying to production:
- Cache correctly identifies similar queries (>90% similarity)
- Cache correctly rejects dissimilar queries (<90% similarity)
- TTL works (entries expire after specified time)
- Cache hit rate is acceptable (>30%)
- Latency improvement is significant (>10x faster)
- Redis memory usage is reasonable
- Cache invalidation works
- Statistics tracking is accurate
- No stale data issues (TTL set appropriately)
Production Deployment
Redis Setup for Production
Option 1: Redis Cloud (Recommended)
- Fully managed
- Automatic backups
- High availability
- Free tier: 30MB (good for testing)
- Paid: $0.12/GB/month
Option 2: Self-Hosted Redis
# docker-compose.yml version: '3.8' services: redis: image: redis/redis-stack:latest ports: - "6379:6379" volumes: - redis-data:/data command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru restart: unless-stopped volumes: redis-data:
Memory Planning:
- Each cached query: ~2KB (embedding + response)
- 10,000 cached queries: ~20MB
- 100,000 cached queries: ~200MB
- Recommended: 1-2GB Redis for production
High Availability
For critical applications:
# Redis Sentinel for automatic failover services: redis-master: image: redis:latest redis-replica: image: redis:latest command: redis-server --replicaof redis-master 6379 redis-sentinel: image: redis:latest command: redis-sentinel /etc/redis/sentinel.conf
Expected Results
Real-World Performance
FAQ Bot (High repetition):
- Cache hit rate: 65%
- Cost savings: $3,200/month
- Latency: 500ms β 12ms (40x faster)
Customer Support (Medium repetition):
- Cache hit rate: 35%
- Cost savings: $1,400/month
- Latency: 600ms β 15ms (40x faster)
General Chat (Low repetition):
- Cache hit rate: 12%
- Cost savings: $400/month
- Latency: 550ms β 18ms (30x faster)
Cost Breakdown
Before Caching:
- 100,000 requests/month
- $0.01 average per request
- Total: $1,000/month
After Caching (40% hit rate):
- 40,000 cached (free): $0
- 60,000 API calls: $600
- Embedding costs: $5 (negligible)
- Redis hosting: $15/month
- Total: $620/month
- Savings: $380/month (38%)
Troubleshooting
Issue: Low cache hit rate (<20%)
Causes:
- Similarity threshold too high
- Queries are truly unique
- Not enough traffic volume
Fix:
# Lower threshold cache = SemanticCache(similarity_threshold=0.85) # Check query diversity stats = semantic_cache.get_stats() print(f"Unique queries: {stats['cache_size']}") print(f"Total requests: {stats['total_requests']}") # If unique queries β total requests, queries are very diverse
Issue: Cache returning wrong answers
Cause: Similarity threshold too low
Fix:
# Increase threshold cache = SemanticCache(similarity_threshold=0.95) # Or disable caching for ambiguous queries def is_ambiguous(query): # Add logic to detect ambiguous queries return len(query.split()) < 3 # Very short queries if not is_ambiguous(query): cached = cache.get(query, model)
Issue: Redis memory growing too large
Cause: No eviction policy or TTL too long
Fix:
# Set Redis eviction policy redis-cli CONFIG SET maxmemory 1gb redis-cli CONFIG SET maxmemory-policy allkeys-lru # Or reduce TTL cache = SemanticCache(ttl=1800) # 30 minutes instead of 1 hour
Issue: Embeddings are expensive
Cause: Generating embeddings for every request
Fix:
# Cache embeddings too # Already implemented in SemanticCache class cache_config: cache_embeddings: true # Reuse embeddings for same query
Next Steps
After implementing semantic caching:
- Monitor cache hit rate for 1 week
- Tune similarity threshold based on accuracy
- Implement cache warming for common queries
- Add cache analytics dashboard
- Consider hybrid caching (exact + semantic)
Additional Resources
- GPTCache: https://github.com/zilliztech/GPTCache
- Redis Vector Search: https://redis.io/docs/stack/search/reference/vectors/
- OpenAI Embeddings: https://platform.openai.com/docs/guides/embeddings
- Semantic Search Guide: https://www.sbert.net/
Support
Questions about semantic caching?
- Onaro Support: support@onaro.io
- Book implementation call: https://onaro.io/support
Estimated Implementation Time: 1-3 hours
Difficulty: βββββ (2/5)
Impact: ππππβ (4/5 - High ROI for repetitive queries)
Last Updated: January 26, 2026
Tested with: Redis 7.2, OpenAI SDK 1.12.0, text-embedding-3-small