πŸ’Ύ

Semantic Caching

Cache similar queries to reduce API costs by up to 80%

Time: 4-6 hoursDifficulty: AdvancedPotential Savings: $1,000-5,000/month

Best For: Applications with repetitive or similar queries

Semantic Caching Implementation Guide

Cache AI Responses to Save 30-50% on Duplicate Queries

Difficulty: Beginner to Intermediate
Time Required: 1-3 hours
Potential Savings: $1,500-5,000/month (30-50% reduction on duplicate queries)
Best For: Applications with repeated questions (FAQ bots, documentation search, customer support)


What is Semantic Caching?

Traditional caching only works for exact matches:

Query: "What is your return policy?"
Cache: HIT βœ“

Query: "What is your refund policy?"
Cache: MISS βœ— (even though it's the same question!)

Semantic caching understands meaning:

Query: "What is your return policy?"
Cache: HIT βœ“

Query: "What is your refund policy?"
Cache: HIT βœ“ (semantically similar, same answer!)

Query: "How do I return an item?"
Cache: HIT βœ“ (also similar!)

How It Works:

  1. Convert query to embedding (vector representation)
  2. Check if similar query exists in cache (using cosine similarity)
  3. If similar enough (>90%) β†’ Return cached response
  4. If not similar β†’ Call AI, cache the new response

Result: 30-50% of queries are cache hits, saving API costs and reducing latency.


Why You Need This

Cost Savings Example:

Before Caching:

  • 100,000 requests/month
  • Average cost: $0.01 per request
  • Total: $1,000/month

After Semantic Caching:

  • 40% cache hit rate
  • 40,000 cached requests: $0/month
  • 60,000 API requests: $600/month
  • Total: $600/month (40% savings)

Additional Benefits:

  • 10x faster responses (cache: 10ms, API: 500ms)
  • Reduced API rate limits (fewer actual API calls)
  • Better user experience (instant responses)
  • More consistent answers (same question = same answer)

Prerequisites

Before implementing:

  • Redis or similar cache store (for storing embeddings + responses)
  • Embedding model access (OpenAI, Anthropic, or local model)
  • Python 3.8+ (for code examples)
  • Basic understanding of vector similarity

Recommended Setup:

  • Redis with vector search (RedisStack or Redis Cloud)
  • OpenAI text-embedding-3-small (cheap, fast embeddings)

Implementation Steps

Step 1: Choose Your Caching Approach

Option A: Use LiteLLM Proxy (Recommended if you have Edge Proxy)

Semantic caching built-in, just configure.

Option B: Application-Level Cache (For direct integrations)

Implement caching in your application code.

Option C: Managed Service (GPTCache, Helicone)

Hosted caching service, no infrastructure.

We'll cover Options A and B in this guide.


Option A: Semantic Caching in LiteLLM Proxy

If you have an Edge Proxy (see Edge Proxy Implementation Guide), semantic caching is built-in.

Step 1: Install Redis

Using Docker:

# Pull Redis with vector search support docker pull redis/redis-stack:latest # Run Redis docker run -d \ --name redis-cache \ -p 6379:6379 \ redis/redis-stack:latest

Or use Redis Cloud (free tier):

Step 2: Configure Semantic Caching

Update your litellm_config.yaml:

model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY # Semantic caching configuration cache_config: type: redis # Redis connection redis_host: localhost redis_port: 6379 # redis_password: os.environ/REDIS_PASSWORD # If password protected # Semantic similarity settings similarity_threshold: 0.9 # 90% similarity = cache hit # Embedding model for semantic similarity embedding_model: text-embedding-3-small # OpenAI's cheapest embedding model embedding_api_key: os.environ/OPENAI_API_KEY # Cache TTL (time to live) ttl: 3600 # Cache for 1 hour (adjust based on content freshness needs) # What to cache cache_responses: true # Cache LLM responses cache_embeddings: true # Cache query embeddings (faster lookups) # Logging log_cache_hits: true # Log when cache is hit (for monitoring) # Router settings router_settings: routing_strategy: least-cost

Step 3: Restart Proxy

docker restart litellm-proxy # Verify caching is enabled curl http://localhost:4000/health | grep cache # Should show: "cache": "enabled"

Step 4: Test Semantic Caching

First request (cache miss):

time curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is your return policy?"}] }' # Response time: ~500ms # Response headers: X-LiteLLM-Cache: MISS

Second request (semantically similar, cache hit):

time curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is your refund policy?"}] }' # Response time: ~10ms (50x faster!) # Response headers: X-LiteLLM-Cache: HIT # Same response as first request

Third request (different topic, cache miss):

curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "What is the capital of France?"}] }' # Response headers: X-LiteLLM-Cache: MISS # Different topic, not similar enough to cached queries

Option B: Application-Level Semantic Cache

If you're making direct API calls, implement semantic caching in your application.

Step 1: Install Dependencies

pip install redis openai numpy scikit-learn --break-system-packages

Step 2: Create Semantic Cache Class

Create semantic_cache.py:

import hashlib import json import logging import numpy as np import redis from openai import OpenAI from sklearn.metrics.pairwise import cosine_similarity from typing import Optional, Dict, Any, List logger = logging.getLogger(__name__) class SemanticCache: """ Semantic caching for AI responses using vector similarity. How it works: 1. Convert query to embedding (vector) 2. Check if similar query exists in cache 3. If similar (>90% similarity), return cached response 4. Otherwise, call AI and cache new response """ def __init__( self, redis_client: redis.Redis, embedding_model: str = "text-embedding-3-small", similarity_threshold: float = 0.9, ttl: int = 3600, # 1 hour default openai_api_key: Optional[str] = None ): self.redis = redis_client self.embedding_model = embedding_model self.similarity_threshold = similarity_threshold self.ttl = ttl self.openai_client = OpenAI(api_key=openai_api_key) # Cache statistics self.hits = 0 self.misses = 0 def get_embedding(self, text: str) -> List[float]: """Convert text to embedding vector""" try: response = self.openai_client.embeddings.create( model=self.embedding_model, input=text ) return response.data[0].embedding except Exception as e: logger.error(f"Failed to get embedding: {e}") return [] def _create_cache_key(self, query: str, model: str) -> str: """Create unique cache key for query+model combination""" combined = f"{model}:{query}" return f"semantic_cache:{hashlib.sha256(combined.encode()).hexdigest()}" def _find_similar_cached_query( self, query_embedding: List[float], model: str ) -> Optional[Dict[str, Any]]: """ Search cache for similar queries using vector similarity. Returns cached response if similarity > threshold. """ # Get all cached queries for this model pattern = f"semantic_cache:*" cached_keys = self.redis.keys(pattern) if not cached_keys: return None query_vec = np.array(query_embedding).reshape(1, -1) best_similarity = 0 best_match = None for key in cached_keys: try: cached_data = json.loads(self.redis.get(key)) # Only compare queries for same model if cached_data.get('model') != model: continue # Get cached query embedding cached_embedding = cached_data.get('embedding') if not cached_embedding: continue # Calculate cosine similarity cached_vec = np.array(cached_embedding).reshape(1, -1) similarity = cosine_similarity(query_vec, cached_vec)[0][0] # Track best match if similarity > best_similarity: best_similarity = similarity best_match = cached_data logger.debug(f"Similarity to cached query: {similarity:.4f}") except Exception as e: logger.warning(f"Error checking cached key {key}: {e}") continue # Return if similarity exceeds threshold if best_similarity >= self.similarity_threshold: logger.info(f"Cache HIT (similarity: {best_similarity:.4f})") return best_match logger.info(f"Cache MISS (best similarity: {best_similarity:.4f})") return None def get( self, query: str, model: str, **kwargs ) -> Optional[Dict[str, Any]]: """ Check cache for semantically similar query. Returns cached response if found, None otherwise. """ # Get query embedding query_embedding = self.get_embedding(query) if not query_embedding: logger.warning("Failed to get query embedding, skipping cache") return None # Search for similar cached query cached_result = self._find_similar_cached_query(query_embedding, model) if cached_result: self.hits += 1 return { 'response': cached_result['response'], 'cached': True, 'cache_key': cached_result.get('cache_key'), 'similarity': cached_result.get('similarity', 1.0) } self.misses += 1 return None def set( self, query: str, model: str, response: str, **kwargs ): """ Cache AI response with query embedding for semantic lookup. """ # Get query embedding query_embedding = self.get_embedding(query) if not query_embedding: logger.warning("Failed to get query embedding, skipping cache") return # Create cache entry cache_key = self._create_cache_key(query, model) cache_data = { 'query': query, 'model': model, 'response': response, 'embedding': query_embedding, 'cache_key': cache_key, 'cached_at': str(datetime.utcnow()) } # Store in Redis with TTL try: self.redis.setex( cache_key, self.ttl, json.dumps(cache_data) ) logger.info(f"Cached response for query: {query[:50]}...") except Exception as e: logger.error(f"Failed to cache response: {e}") def invalidate(self, pattern: str = "*"): """Clear cache entries matching pattern""" keys = self.redis.keys(f"semantic_cache:{pattern}") if keys: self.redis.delete(*keys) logger.info(f"Invalidated {len(keys)} cache entries") def get_stats(self) -> Dict[str, Any]: """Get cache statistics""" total = self.hits + self.misses hit_rate = (self.hits / total * 100) if total > 0 else 0 return { 'hits': self.hits, 'misses': self.misses, 'total_requests': total, 'hit_rate_pct': round(hit_rate, 2), 'cache_size': len(self.redis.keys('semantic_cache:*')) } # Example usage wrapper class CachedAIProvider: """Wrapper that adds semantic caching to AI provider calls""" def __init__(self, cache: SemanticCache, openai_api_key: str): self.cache = cache self.client = OpenAI(api_key=openai_api_key) def chat_completion( self, messages: List[Dict[str, str]], model: str = "gpt-4o-mini", **kwargs ) -> Dict[str, Any]: """ Chat completion with automatic semantic caching. Checks cache first, calls API if cache miss. """ # Extract user query (last message) query = messages[-1]['content'] if messages else "" # Check cache cached_response = self.cache.get(query, model) if cached_response: return { 'content': cached_response['response'], 'model': model, 'cached': True, 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0} } # Cache miss - call API logger.info(f"Cache miss, calling {model}") response = self.client.chat.completions.create( model=model, messages=messages, **kwargs ) content = response.choices[0].message.content # Cache the response self.cache.set(query, model, content) return { 'content': content, 'model': model, 'cached': False, 'usage': { 'prompt_tokens': response.usage.prompt_tokens, 'completion_tokens': response.usage.completion_tokens, 'total_tokens': response.usage.total_tokens } } # Singleton instances redis_client = redis.Redis( host='localhost', port=6379, decode_responses=False # We'll handle JSON encoding ) semantic_cache = SemanticCache( redis_client=redis_client, similarity_threshold=0.9, ttl=3600 # 1 hour ) cached_ai = CachedAIProvider( cache=semantic_cache, openai_api_key=os.environ.get('OPENAI_API_KEY') )

Step 3: Update Your Application Code

Before (No caching):

import openai client = openai.OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "What is your return policy?"}] ) print(response.choices[0].message.content)

After (With semantic caching):

from semantic_cache import cached_ai # First call - cache miss, calls API response1 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is your return policy?"}], model="gpt-4o-mini" ) print(f"Response 1 (cached: {response1['cached']}): {response1['content']}") # Output: Response 1 (cached: False): [AI response...] # Second call - semantically similar, cache hit! response2 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is your refund policy?"}], model="gpt-4o-mini" ) print(f"Response 2 (cached: {response2['cached']}): {response2['content']}") # Output: Response 2 (cached: True): [Same response, 50x faster!] # Different query - cache miss response3 = cached_ai.chat_completion( messages=[{"role": "user", "content": "What is the capital of France?"}], model="gpt-4o-mini" ) print(f"Response 3 (cached: {response3['cached']}): {response3['content']}") # Output: Response 3 (cached: False): [New AI response]

Step 4: Monitor Cache Performance

# Get cache statistics stats = semantic_cache.get_stats() print(f""" Cache Performance: - Hits: {stats['hits']} - Misses: {stats['misses']} - Hit Rate: {stats['hit_rate_pct']}% - Cache Size: {stats['cache_size']} entries """) # Expected after 1000 requests in a support bot: # Hits: 400 # Misses: 600 # Hit Rate: 40% # Cache Size: 150 entries (unique questions)

Advanced Configuration

1. Adjust Similarity Threshold

# Conservative (fewer cache hits, more accurate) cache = SemanticCache(similarity_threshold=0.95) # 95% similar # Aggressive (more cache hits, may be less accurate) cache = SemanticCache(similarity_threshold=0.85) # 85% similar # Recommended for most use cases cache = SemanticCache(similarity_threshold=0.9) # 90% similar

How to choose:

  • 0.95+ - FAQ bots where exact answers matter
  • 0.90 - General customer support
  • 0.85 - Documentation search where close matches are ok

2. Different TTL by Use Case

# Short TTL for time-sensitive content cache_news = SemanticCache(ttl=300) # 5 minutes # Long TTL for static content cache_docs = SemanticCache(ttl=86400) # 24 hours # No expiration for permanent content cache_faq = SemanticCache(ttl=None) # Never expires

3. Namespace Caches by Context

# Separate caches for different parts of your app cache_support = SemanticCache(redis_client, namespace="support") cache_sales = SemanticCache(redis_client, namespace="sales") cache_internal = SemanticCache(redis_client, namespace="internal") # Each has independent cache storage

4. Exclude Certain Queries from Caching

def should_cache(query: str, model: str) -> bool: """Decide if query should be cached""" # Don't cache personalized queries if any(word in query.lower() for word in ['my account', 'my order', 'my name']): return False # Don't cache creative generation if 'write a' in query.lower() or 'generate' in query.lower(): return False # Don't cache very long queries (embeddings are expensive) if len(query) > 1000: return False return True # Use in your code: if should_cache(query, model): cached = cache.get(query, model)

5. Cache Warming

Pre-populate cache with common queries:

# common_queries.txt common_queries = [ "What is your return policy?", "How do I track my order?", "What are your business hours?", "Do you ship internationally?", # ... 50 more common questions ] # Warm the cache for query in common_queries: response = cached_ai.chat_completion( messages=[{"role": "user", "content": query}], model="gpt-4o-mini" ) print(f"Cached: {query[:50]}...") print(f"Cache warmed with {len(common_queries)} queries")

Monitoring & Optimization

Key Metrics to Track

  1. Cache Hit Rate:

    • Target: 30-50% for support bots
    • Target: 60-80% for FAQ bots
    • Target: 10-20% for general chat
  2. Cost Savings:

    savings = (cache_hits * avg_cost_per_request) savings_pct = (cache_hits / total_requests * 100)
  3. Latency Improvement:

    • Cache hit: ~10ms
    • API call: ~500ms
    • Improvement: 50x faster
  4. Cache Size:

    • Monitor growth over time
    • Set limits if needed
    • Evict old entries (LRU policy)

Dashboard Example

from flask import Flask, jsonify from semantic_cache import semantic_cache app = Flask(__name__) @app.route('/cache/stats') def cache_stats(): stats = semantic_cache.get_stats() # Calculate cost savings avg_cost = 0.0001 # $0.0001 per request cost_saved = stats['hits'] * avg_cost return jsonify({ **stats, 'cost_saved_usd': round(cost_saved, 4), 'estimated_monthly_savings': round(cost_saved * 30, 2) }) @app.route('/cache/top-queries') def top_queries(): """Show most frequently cached queries""" # Get all cache keys keys = redis_client.keys('semantic_cache:*') queries = [] for key in keys[:50]: # Top 50 data = json.loads(redis_client.get(key)) queries.append({ 'query': data['query'][:100], 'model': data['model'], 'cached_at': data['cached_at'] }) return jsonify(queries)

Testing Checklist

Before deploying to production:

  • Cache correctly identifies similar queries (>90% similarity)
  • Cache correctly rejects dissimilar queries (<90% similarity)
  • TTL works (entries expire after specified time)
  • Cache hit rate is acceptable (>30%)
  • Latency improvement is significant (>10x faster)
  • Redis memory usage is reasonable
  • Cache invalidation works
  • Statistics tracking is accurate
  • No stale data issues (TTL set appropriately)

Production Deployment

Redis Setup for Production

Option 1: Redis Cloud (Recommended)

  • Fully managed
  • Automatic backups
  • High availability
  • Free tier: 30MB (good for testing)
  • Paid: $0.12/GB/month

Option 2: Self-Hosted Redis

# docker-compose.yml version: '3.8' services: redis: image: redis/redis-stack:latest ports: - "6379:6379" volumes: - redis-data:/data command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru restart: unless-stopped volumes: redis-data:

Memory Planning:

  • Each cached query: ~2KB (embedding + response)
  • 10,000 cached queries: ~20MB
  • 100,000 cached queries: ~200MB
  • Recommended: 1-2GB Redis for production

High Availability

For critical applications:

# Redis Sentinel for automatic failover services: redis-master: image: redis:latest redis-replica: image: redis:latest command: redis-server --replicaof redis-master 6379 redis-sentinel: image: redis:latest command: redis-sentinel /etc/redis/sentinel.conf

Expected Results

Real-World Performance

FAQ Bot (High repetition):

  • Cache hit rate: 65%
  • Cost savings: $3,200/month
  • Latency: 500ms β†’ 12ms (40x faster)

Customer Support (Medium repetition):

  • Cache hit rate: 35%
  • Cost savings: $1,400/month
  • Latency: 600ms β†’ 15ms (40x faster)

General Chat (Low repetition):

  • Cache hit rate: 12%
  • Cost savings: $400/month
  • Latency: 550ms β†’ 18ms (30x faster)

Cost Breakdown

Before Caching:

  • 100,000 requests/month
  • $0.01 average per request
  • Total: $1,000/month

After Caching (40% hit rate):

  • 40,000 cached (free): $0
  • 60,000 API calls: $600
  • Embedding costs: $5 (negligible)
  • Redis hosting: $15/month
  • Total: $620/month
  • Savings: $380/month (38%)

Troubleshooting

Issue: Low cache hit rate (<20%)

Causes:

  1. Similarity threshold too high
  2. Queries are truly unique
  3. Not enough traffic volume

Fix:

# Lower threshold cache = SemanticCache(similarity_threshold=0.85) # Check query diversity stats = semantic_cache.get_stats() print(f"Unique queries: {stats['cache_size']}") print(f"Total requests: {stats['total_requests']}") # If unique queries β‰ˆ total requests, queries are very diverse

Issue: Cache returning wrong answers

Cause: Similarity threshold too low

Fix:

# Increase threshold cache = SemanticCache(similarity_threshold=0.95) # Or disable caching for ambiguous queries def is_ambiguous(query): # Add logic to detect ambiguous queries return len(query.split()) < 3 # Very short queries if not is_ambiguous(query): cached = cache.get(query, model)

Issue: Redis memory growing too large

Cause: No eviction policy or TTL too long

Fix:

# Set Redis eviction policy redis-cli CONFIG SET maxmemory 1gb redis-cli CONFIG SET maxmemory-policy allkeys-lru # Or reduce TTL cache = SemanticCache(ttl=1800) # 30 minutes instead of 1 hour

Issue: Embeddings are expensive

Cause: Generating embeddings for every request

Fix:

# Cache embeddings too # Already implemented in SemanticCache class cache_config: cache_embeddings: true # Reuse embeddings for same query

Next Steps

After implementing semantic caching:

  1. Monitor cache hit rate for 1 week
  2. Tune similarity threshold based on accuracy
  3. Implement cache warming for common queries
  4. Add cache analytics dashboard
  5. Consider hybrid caching (exact + semantic)

Additional Resources


Support

Questions about semantic caching?

Estimated Implementation Time: 1-3 hours
Difficulty: β­β­β˜†β˜†β˜† (2/5)
Impact: πŸš€πŸš€πŸš€πŸš€β˜† (4/5 - High ROI for repetitive queries)


Last Updated: January 26, 2026
Tested with: Redis 7.2, OpenAI SDK 1.12.0, text-embedding-3-small