Semantic Caching Implementation Guide

Cache AI Responses to Save 30-50% on Duplicate Queries

Difficulty: Beginner to Intermediate
Time Required: 1-3 hours
Potential Savings: $1,500-5,000/month (30-50% reduction on duplicate queries)
Best For: Applications with repeated questions (FAQ bots, documentation search, customer support)

What is Semantic Caching?

Traditional caching only works for exact matches:

Query: "What is your return policy?"
Cache: HIT ✓

Query: "What is your refund policy?"
Cache: MISS ✗ (even though it's the same question!)

Semantic caching understands meaning:

Query: "What is your return policy?"
Cache: HIT ✓

Query: "What is your refund policy?"
Cache: HIT ✓ (semantically similar, same answer!)

Query: "How do I return an item?"
Cache: HIT ✓ (also similar!)

How It Works:

Convert query to embedding (vector representation)
Check if similar query exists in cache (using cosine similarity)
If similar enough (>90%) → Return cached response
If not similar → Call AI, cache the new response

Result: 30-50% of queries are cache hits, saving API costs and reducing latency.

Why You Need This

Cost Savings Example:

Before Caching:

100,000 requests/month
Average cost: $0.01 per request
Total: $1,000/month

After Semantic Caching:

40% cache hit rate
40,000 cached requests: $0/month
60,000 API requests: $600/month
Total: $600/month (40% savings)

Additional Benefits:

10x faster responses (cache: 10ms, API: 500ms)
Reduced API rate limits (fewer actual API calls)
Better user experience (instant responses)
More consistent answers (same question = same answer)

Prerequisites

Before implementing:

Redis or similar cache store (for storing embeddings + responses)
Embedding model access (OpenAI, Anthropic, or local model)
Python 3.8+ (for code examples)
Basic understanding of vector similarity

Recommended Setup:

Redis with vector search (RedisStack or Redis Cloud)
OpenAI text-embedding-3-small (cheap, fast embeddings)

Implementation Steps

Step 1: Choose Your Caching Approach

Option A: Use LiteLLM Proxy (Recommended if you have Edge Proxy)

Semantic caching built-in, just configure.

Option B: Application-Level Cache (For direct integrations)

Implement caching in your application code.

Option C: Managed Service (GPTCache, Helicone)

Hosted caching service, no infrastructure.

We'll cover Options A and B in this guide.

Option A: Semantic Caching in LiteLLM Proxy

If you have an Edge Proxy (see Edge Proxy Implementation Guide), semantic caching is built-in.

Step 1: Install Redis

Using Docker:

# Pull Redis with vector search support
docker pull redis/redis-stack:latest

# Run Redis
docker run -d \
  --name redis-cache \
  -p 6379:6379 \
  redis/redis-stack:latest

Or use Redis Cloud (free tier):

Sign up at https://redis.com/try-free/
Get connection string: redis://username:password@host:port

Step 2: Configure Semantic Caching

Update your litellm_config.yaml:

model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

# Semantic caching configuration
cache_config:
  type: redis
  
  # Redis connection
  redis_host: localhost
  redis_port: 6379
  # redis_password: os.environ/REDIS_PASSWORD  # If password protected
  
  # Semantic similarity settings
  similarity_threshold: 0.9  # 90% similarity = cache hit
  
  # Embedding model for semantic similarity
  embedding_model: text-embedding-3-small  # OpenAI's cheapest embedding model
  embedding_api_key: os.environ/OPENAI_API_KEY
  
  # Cache TTL (time to live)
  ttl: 3600  # Cache for 1 hour (adjust based on content freshness needs)
  
  # What to cache
  cache_responses: true  # Cache LLM responses
  cache_embeddings: true  # Cache query embeddings (faster lookups)
  
  # Logging
  log_cache_hits: true  # Log when cache is hit (for monitoring)

# Router settings
router_settings:
  routing_strategy: least-cost

Step 3: Restart Proxy

docker restart litellm-proxy

# Verify caching is enabled
curl http://localhost:4000/health | grep cache
# Should show: "cache": "enabled"

Step 4: Test Semantic Caching

First request (cache miss):

time curl -X POST http://localhost:4000/chat/completions \
  -H "Authorization: Bearer sk-1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is your return policy?"}]
  }'

# Response time: ~500ms
# Response headers: X-LiteLLM-Cache: MISS

Second request (semantically similar, cache hit):

time curl -X POST http://localhost:4000/chat/completions \
  -H "Authorization: Bearer sk-1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is your refund policy?"}]
  }'

# Response time: ~10ms (50x faster!)
# Response headers: X-LiteLLM-Cache: HIT
# Same response as first request

Third request (different topic, cache miss):

curl -X POST http://localhost:4000/chat/completions \
  -H "Authorization: Bearer sk-1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

# Response headers: X-LiteLLM-Cache: MISS
# Different topic, not similar enough to cached queries

Option B: Application-Level Semantic Cache

If you're making direct API calls, implement semantic caching in your application.

Step 1: Install Dependencies

pip install redis openai numpy scikit-learn --break-system-packages

Step 2: Create Semantic Cache Class

Create semantic_cache.py:

import hashlib
import json
import logging
import numpy as np
import redis
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from typing import Optional, Dict, Any, List

logger = logging.getLogger(__name__)


class SemanticCache:
    """
    Semantic caching for AI responses using vector similarity.
    
    How it works:
    1. Convert query to embedding (vector)
    2. Check if similar query exists in cache
    3. If similar (>90% similarity), return cached response
    4. Otherwise, call AI and cache new response
    """
    
    def __init__(
        self,
        redis_client: redis.Redis,
        embedding_model: str = "text-embedding-3-small",
        similarity_threshold: float = 0.9,
        ttl: int = 3600,  # 1 hour default
        openai_api_key: Optional[str] = None
    ):
        self.redis = redis_client
        self.embedding_model = embedding_model
        self.similarity_threshold = similarity_threshold
        self.ttl = ttl
        
        self.openai_client = OpenAI(api_key=openai_api_key)
        
        # Cache statistics
        self.hits = 0
        self.misses = 0
    
    def get_embedding(self, text: str) -> List[float]:
        """Convert text to embedding vector"""
        try:
            response = self.openai_client.embeddings.create(
                model=self.embedding_model,
                input=text
            )
            return response.data[0].embedding
        except Exception as e:
            logger.error(f"Failed to get embedding: {e}")
            return []
    
    def _create_cache_key(self, query: str, model: str) -> str:
        """Create unique cache key for query+model combination"""
        combined = f"{model}:{query}"
        return f"semantic_cache:{hashlib.sha256(combined.encode()).hexdigest()}"
    
    def _find_similar_cached_query(
        self,
        query_embedding: List[float],
        model: str
    ) -> Optional[Dict[str, Any]]:
        """
        Search cache for similar queries using vector similarity.
        
        Returns cached response if similarity > threshold.
        """
        # Get all cached queries for this model
        pattern = f"semantic_cache:*"
        cached_keys = self.redis.keys(pattern)
        
        if not cached_keys:
            return None
        
        query_vec = np.array(query_embedding).reshape(1, -1)
        best_similarity = 0
        best_match = None
        
        for key in cached_keys:
            try:
                cached_data = json.loads(self.redis.get(key))
                
                # Only compare queries for same model
                if cached_data.get('model') != model:
                    continue
                
                # Get cached query embedding
                cached_embedding = cached_data.get('embedding')
                if not cached_embedding:
                    continue
                
                # Calculate cosine similarity
                cached_vec = np.array(cached_embedding).reshape(1, -1)
                similarity = cosine_similarity(query_vec, cached_vec)[0][0]
                
                # Track best match
                if similarity > best_similarity:
                    best_similarity = similarity
                    best_match = cached_data
                
                logger.debug(f"Similarity to cached query: {similarity:.4f}")
                
            except Exception as e:
                logger.warning(f"Error checking cached key {key}: {e}")
                continue
        
        # Return if similarity exceeds threshold
        if best_similarity >= self.similarity_threshold:
            logger.info(f"Cache HIT (similarity: {best_similarity:.4f})")
            return best_match
        
        logger.info(f"Cache MISS (best similarity: {best_similarity:.4f})")
        return None
    
    def get(
        self,
        query: str,
        model: str,
        **kwargs
    ) -> Optional[Dict[str, Any]]:
        """
        Check cache for semantically similar query.
        
        Returns cached response if found, None otherwise.
        """
        # Get query embedding
        query_embedding = self.get_embedding(query)
        if not query_embedding:
            logger.warning("Failed to get query embedding, skipping cache")
            return None
        
        # Search for similar cached query
        cached_result = self._find_similar_cached_query(query_embedding, model)
        
        if cached_result:
            self.hits += 1
            return {
                'response': cached_result['response'],
                'cached': True,
                'cache_key': cached_result.get('cache_key'),
                'similarity': cached_result.get('similarity', 1.0)
            }
        
        self.misses += 1
        return None
    
    def set(
        self,
        query: str,
        model: str,
        response: str,
        **kwargs
    ):
        """
        Cache AI response with query embedding for semantic lookup.
        """
        # Get query embedding
        query_embedding = self.get_embedding(query)
        if not query_embedding:
            logger.warning("Failed to get query embedding, skipping cache")
            return
        
        # Create cache entry
        cache_key = self._create_cache_key(query, model)
        cache_data = {
            'query': query,
            'model': model,
            'response': response,
            'embedding': query_embedding,
            'cache_key': cache_key,
            'cached_at': str(datetime.utcnow())
        }
        
        # Store in Redis with TTL
        try:
            self.redis.setex(
                cache_key,
                self.ttl,
                json.dumps(cache_data)
            )
            logger.info(f"Cached response for query: {query[:50]}...")
        except Exception as e:
            logger.error(f"Failed to cache response: {e}")
    
    def invalidate(self, pattern: str = "*"):
        """Clear cache entries matching pattern"""
        keys = self.redis.keys(f"semantic_cache:{pattern}")
        if keys:
            self.redis.delete(*keys)
            logger.info(f"Invalidated {len(keys)} cache entries")
    
    def get_stats(self) -> Dict[str, Any]:
        """Get cache statistics"""
        total = self.hits + self.misses
        hit_rate = (self.hits / total * 100) if total > 0 else 0
        
        return {
            'hits': self.hits,
            'misses': self.misses,
            'total_requests': total,
            'hit_rate_pct': round(hit_rate, 2),
            'cache_size': len(self.redis.keys('semantic_cache:*'))
        }


# Example usage wrapper
class CachedAIProvider:
    """Wrapper that adds semantic caching to AI provider calls"""
    
    def __init__(self, cache: SemanticCache, openai_api_key: str):
        self.cache = cache
        self.client = OpenAI(api_key=openai_api_key)
    
    def chat_completion(
        self,
        messages: List[Dict[str, str]],
        model: str = "gpt-4o-mini",
        **kwargs
    ) -> Dict[str, Any]:
        """
        Chat completion with automatic semantic caching.
        
        Checks cache first, calls API if cache miss.
        """
        # Extract user query (last message)
        query = messages[-1]['content'] if messages else ""
        
        # Check cache
        cached_response = self.cache.get(query, model)
        if cached_response:
            return {
                'content': cached_response['response'],
                'model': model,
                'cached': True,
                'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}
            }
        
        # Cache miss - call API
        logger.info(f"Cache miss, calling {model}")
        response = self.client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )
        
        content = response.choices[0].message.content
        
        # Cache the response
        self.cache.set(query, model, content)
        
        return {
            'content': content,
            'model': model,
            'cached': False,
            'usage': {
                'prompt_tokens': response.usage.prompt_tokens,
                'completion_tokens': response.usage.completion_tokens,
                'total_tokens': response.usage.total_tokens
            }
        }


# Singleton instances
redis_client = redis.Redis(
    host='localhost',
    port=6379,
    decode_responses=False  # We'll handle JSON encoding
)

semantic_cache = SemanticCache(
    redis_client=redis_client,
    similarity_threshold=0.9,
    ttl=3600  # 1 hour
)

cached_ai = CachedAIProvider(
    cache=semantic_cache,
    openai_api_key=os.environ.get('OPENAI_API_KEY')
)

Step 3: Update Your Application Code

Before (No caching):

import openai

client = openai.OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

print(response.choices[0].message.content)

After (With semantic caching):

from semantic_cache import cached_ai

# First call - cache miss, calls API
response1 = cached_ai.chat_completion(
    messages=[{"role": "user", "content": "What is your return policy?"}],
    model="gpt-4o-mini"
)
print(f"Response 1 (cached: {response1['cached']}): {response1['content']}")
# Output: Response 1 (cached: False): [AI response...]

# Second call - semantically similar, cache hit!
response2 = cached_ai.chat_completion(
    messages=[{"role": "user", "content": "What is your refund policy?"}],
    model="gpt-4o-mini"
)
print(f"Response 2 (cached: {response2['cached']}): {response2['content']}")
# Output: Response 2 (cached: True): [Same response, 50x faster!]

# Different query - cache miss
response3 = cached_ai.chat_completion(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    model="gpt-4o-mini"
)
print(f"Response 3 (cached: {response3['cached']}): {response3['content']}")
# Output: Response 3 (cached: False): [New AI response]

Step 4: Monitor Cache Performance

# Get cache statistics
stats = semantic_cache.get_stats()
print(f"""
Cache Performance:
- Hits: {stats['hits']}
- Misses: {stats['misses']}
- Hit Rate: {stats['hit_rate_pct']}%
- Cache Size: {stats['cache_size']} entries
""")

# Expected after 1000 requests in a support bot:
# Hits: 400
# Misses: 600
# Hit Rate: 40%
# Cache Size: 150 entries (unique questions)

Advanced Configuration

1. Adjust Similarity Threshold

# Conservative (fewer cache hits, more accurate)
cache = SemanticCache(similarity_threshold=0.95)  # 95% similar

# Aggressive (more cache hits, may be less accurate)
cache = SemanticCache(similarity_threshold=0.85)  # 85% similar

# Recommended for most use cases
cache = SemanticCache(similarity_threshold=0.9)   # 90% similar

How to choose:

0.95+ - FAQ bots where exact answers matter
0.90 - General customer support
0.85 - Documentation search where close matches are ok

2. Different TTL by Use Case

# Short TTL for time-sensitive content
cache_news = SemanticCache(ttl=300)  # 5 minutes

# Long TTL for static content
cache_docs = SemanticCache(ttl=86400)  # 24 hours

# No expiration for permanent content
cache_faq = SemanticCache(ttl=None)  # Never expires

3. Namespace Caches by Context

# Separate caches for different parts of your app
cache_support = SemanticCache(redis_client, namespace="support")
cache_sales = SemanticCache(redis_client, namespace="sales")
cache_internal = SemanticCache(redis_client, namespace="internal")

# Each has independent cache storage

4. Exclude Certain Queries from Caching

def should_cache(query: str, model: str) -> bool:
    """Decide if query should be cached"""
    
    # Don't cache personalized queries
    if any(word in query.lower() for word in ['my account', 'my order', 'my name']):
        return False
    
    # Don't cache creative generation
    if 'write a' in query.lower() or 'generate' in query.lower():
        return False
    
    # Don't cache very long queries (embeddings are expensive)
    if len(query) > 1000:
        return False
    
    return True

# Use in your code:
if should_cache(query, model):
    cached = cache.get(query, model)

5. Cache Warming

Pre-populate cache with common queries:

# common_queries.txt
common_queries = [
    "What is your return policy?",
    "How do I track my order?",
    "What are your business hours?",
    "Do you ship internationally?",
    # ... 50 more common questions
]

# Warm the cache
for query in common_queries:
    response = cached_ai.chat_completion(
        messages=[{"role": "user", "content": query}],
        model="gpt-4o-mini"
    )
    print(f"Cached: {query[:50]}...")

print(f"Cache warmed with {len(common_queries)} queries")

Monitoring & Optimization

Key Metrics to Track

Cache Hit Rate:
- Target: 30-50% for support bots
- Target: 60-80% for FAQ bots
- Target: 10-20% for general chat

Cost Savings:

savings = (cache_hits * avg_cost_per_request)
savings_pct = (cache_hits / total_requests * 100)

Latency Improvement:
- Cache hit: ~10ms
- API call: ~500ms
- Improvement: 50x faster
Cache Size:
- Monitor growth over time
- Set limits if needed
- Evict old entries (LRU policy)

Dashboard Example

from flask import Flask, jsonify
from semantic_cache import semantic_cache

app = Flask(__name__)

@app.route('/cache/stats')
def cache_stats():
    stats = semantic_cache.get_stats()
    
    # Calculate cost savings
    avg_cost = 0.0001  # $0.0001 per request
    cost_saved = stats['hits'] * avg_cost
    
    return jsonify({
        **stats,
        'cost_saved_usd': round(cost_saved, 4),
        'estimated_monthly_savings': round(cost_saved * 30, 2)
    })

@app.route('/cache/top-queries')
def top_queries():
    """Show most frequently cached queries"""
    # Get all cache keys
    keys = redis_client.keys('semantic_cache:*')
    
    queries = []
    for key in keys[:50]:  # Top 50
        data = json.loads(redis_client.get(key))
        queries.append({
            'query': data['query'][:100],
            'model': data['model'],
            'cached_at': data['cached_at']
        })
    
    return jsonify(queries)

Testing Checklist

Before deploying to production:

Cache correctly identifies similar queries (>90% similarity)
Cache correctly rejects dissimilar queries (<90% similarity)
TTL works (entries expire after specified time)
Cache hit rate is acceptable (>30%)
Latency improvement is significant (>10x faster)
Redis memory usage is reasonable
Cache invalidation works
Statistics tracking is accurate
No stale data issues (TTL set appropriately)

Production Deployment

Redis Setup for Production

Option 1: Redis Cloud (Recommended)

Fully managed
Automatic backups
High availability
Free tier: 30MB (good for testing)
Paid: $0.12/GB/month

Option 2: Self-Hosted Redis

# docker-compose.yml
version: '3.8'
services:
  redis:
    image: redis/redis-stack:latest
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
    restart: unless-stopped

volumes:
  redis-data:

Memory Planning:

Each cached query: ~2KB (embedding + response)
10,000 cached queries: ~20MB
100,000 cached queries: ~200MB
Recommended: 1-2GB Redis for production

High Availability

For critical applications:

# Redis Sentinel for automatic failover
services:
  redis-master:
    image: redis:latest
  redis-replica:
    image: redis:latest
    command: redis-server --replicaof redis-master 6379
  redis-sentinel:
    image: redis:latest
    command: redis-sentinel /etc/redis/sentinel.conf

Expected Results

Real-World Performance

FAQ Bot (High repetition):

Cache hit rate: 65%
Cost savings: $3,200/month
Latency: 500ms → 12ms (40x faster)

Customer Support (Medium repetition):

Cache hit rate: 35%
Cost savings: $1,400/month
Latency: 600ms → 15ms (40x faster)

General Chat (Low repetition):

Cache hit rate: 12%
Cost savings: $400/month
Latency: 550ms → 18ms (30x faster)

Cost Breakdown

Before Caching:

100,000 requests/month
$0.01 average per request
Total: $1,000/month

After Caching (40% hit rate):

40,000 cached (free): $0
60,000 API calls: $600
Embedding costs: $5 (negligible)
Redis hosting: $15/month
Total: $620/month
Savings: $380/month (38%)

Troubleshooting

Issue: Low cache hit rate (<20%)

Causes:

Similarity threshold too high
Queries are truly unique
Not enough traffic volume

Fix:

# Lower threshold
cache = SemanticCache(similarity_threshold=0.85)

# Check query diversity
stats = semantic_cache.get_stats()
print(f"Unique queries: {stats['cache_size']}")
print(f"Total requests: {stats['total_requests']}")
# If unique queries ≈ total requests, queries are very diverse

Issue: Cache returning wrong answers

Cause: Similarity threshold too low

Fix:

# Increase threshold
cache = SemanticCache(similarity_threshold=0.95)

# Or disable caching for ambiguous queries
def is_ambiguous(query):
    # Add logic to detect ambiguous queries
    return len(query.split()) < 3  # Very short queries

if not is_ambiguous(query):
    cached = cache.get(query, model)

Issue: Redis memory growing too large

Cause: No eviction policy or TTL too long

Fix:

# Set Redis eviction policy
redis-cli CONFIG SET maxmemory 1gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru

# Or reduce TTL
cache = SemanticCache(ttl=1800)  # 30 minutes instead of 1 hour

Issue: Embeddings are expensive

Cause: Generating embeddings for every request

Fix:

# Cache embeddings too
# Already implemented in SemanticCache class
cache_config:
  cache_embeddings: true  # Reuse embeddings for same query

Next Steps

After implementing semantic caching:

Monitor cache hit rate for 1 week
Tune similarity threshold based on accuracy
Implement cache warming for common queries
Add cache analytics dashboard
Consider hybrid caching (exact + semantic)

Additional Resources

Support

Questions about semantic caching?

Onaro Support: support@onaro.io
Book implementation call

Estimated Implementation Time: 1-3 hours
Difficulty: ⭐⭐☆☆☆ (2/5)
Impact: 🚀🚀🚀🚀☆ (4/5 - High ROI for repetitive queries)

Last Updated: January 26, 2026
Tested with: Redis 7.2, OpenAI SDK 1.12.0, text-embedding-3-small