Circuit Breakers
Prevent cascading failures and reduce costs during outages
Best For: Production systems with high availability requirements
Circuit Breaker Implementation Guide
Prevent Cascade Failures & Improve AI API Reliability
Difficulty: Intermediate
Time Required: 2-3 hours
Potential Savings: Prevents catastrophic cost overruns (up to $10K+ saved in outage scenarios)
Best For: All production applications using AI APIs
What is a Circuit Breaker?
A Circuit Breaker is a design pattern that prevents your application from repeatedly calling a failing service, protecting your system from:
- Cascade failures (one failing API causes your entire app to fail)
- Cost overruns (retrying failed requests thousands of times)
- Poor user experience (long timeouts instead of fast failures)
- Resource exhaustion (threads/connections stuck waiting)
The Problem Without Circuit Breakers:
Your App β [Call OpenAI] β 500 Error
Your App β [Retry] β 500 Error
Your App β [Retry] β 500 Error
Your App β [Retry] β 500 Error (repeats 1000x)
Result:
- App becomes unresponsive
- $5,000 in wasted API calls
- 30 minute recovery time
The Solution With Circuit Breakers:
Your App β [Call OpenAI] β 500 Error
Your App β [Retry] β 500 Error
Your App β [Retry] β 500 Error
Circuit Breaker: "Provider is down, stop trying!"
Circuit Status: OPEN (requests blocked)
Your App β [Fast fail] β Use fallback provider
Your App β [Fast fail] β Use cached response
Your App β [Fast fail] β Show user friendly error
After 30 seconds:
Circuit Status: HALF-OPEN (test if provider recovered)
Your App β [Test call] β Success! β Circuit CLOSED
Result:
- App remains responsive
- Only 3 failed calls instead of 1000
- Automatic recovery
- $4,997 saved
How Circuit Breakers Work
Three States:
-
CLOSED (Normal operation)
- All requests pass through
- Monitoring for failures
-
OPEN (Provider is down)
- All requests immediately fail
- No calls to failing provider
- Save time and money
-
HALF-OPEN (Testing recovery)
- Allow one test request
- If success β CLOSED
- If failure β OPEN again
State Transitions:
[Normal]
β
CLOSED βββββββββββββββ
β β
[Too many β
failures] [Success]
β β
OPEN βββββββββ HALF-OPEN
[Wait 30s] [Test call]
β
[Failure]
β
OPEN
Prerequisites
Before implementing:
- Edge Proxy deployed (see Edge Proxy Guide) - OR -
- Direct AI API integration in your application
- Python 3.8+ (for code examples) or your language of choice
- Basic understanding of error handling and retries
Implementation Steps
Step 1: Choose Your Implementation Approach
Option A: Use LiteLLM Proxy (Recommended if you have Edge Proxy)
Circuit breakers built-in, no code needed.
Option B: Application-Level Circuit Breaker (For direct API calls)
Add circuit breaker to your existing code using pybreaker library.
Option C: Service Mesh (Advanced/Enterprise)
Use Istio or Linkerd for circuit breakers at infrastructure level.
We'll cover Options A and B in this guide.
Option A: Circuit Breakers in LiteLLM Proxy
If you already have an Edge Proxy (from the Edge Proxy Implementation Guide), circuit breakers are built-in.
Step 1: Enable Circuit Breakers in Config
Update your litellm_config.yaml:
model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY - model_name: claude-3-5-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY # Circuit breaker configuration router_settings: routing_strategy: least-cost # Circuit breaker settings circuit_breaker: enabled: true # Failure threshold failure_threshold: 5 # Open circuit after 5 failures # Time window for counting failures window_size: 60 # Count failures in last 60 seconds # Recovery timeout recovery_timeout: 30 # Wait 30 seconds before testing recovery # Success threshold for recovery success_threshold: 2 # Need 2 successes to close circuit # What counts as a failure? failure_conditions: - status_code: 500 - status_code: 502 - status_code: 503 - status_code: 504 - timeout: true - rate_limit: true # Treat rate limits as failures # Fallback when circuit is open fallbacks: - claude-3-5-sonnet # Use Claude if OpenAI is down - gpt-4o-azure # Then try Azure # Alert when circuit opens alerting: webhook_url: https://your-app.com/webhooks/circuit-breaker slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK
Step 2: Restart Proxy
docker restart litellm-proxy # Or if using docker-compose docker-compose restart
Step 3: Test Circuit Breaker
Simulate provider failure:
# Send 6 requests with invalid API key (will fail) for i in {1..6}; do curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' sleep 1 done # Check circuit status curl http://localhost:4000/metrics | grep circuit_breaker # Should show: circuit_breaker{provider="openai"} = 1 (1 = OPEN)
Verify fallback:
# This request should now go to Claude (fallback) curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' # Check response headers # X-LiteLLM-Provider: anthropic (routed to fallback)
Wait for recovery test:
# Wait 30 seconds, then send request sleep 30 # This will test if OpenAI recovered curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' # If successful, circuit closes automatically
Option B: Application-Level Circuit Breaker
If you're making direct API calls (no proxy), implement circuit breakers in your application code.
Step 1: Install Circuit Breaker Library
pip install pybreaker --break-system-packages
Step 2: Create Circuit Breaker Wrapper
Create ai_circuit_breaker.py:
import logging from functools import wraps from pybreaker import CircuitBreaker, CircuitBreakerError import openai from anthropic import Anthropic logger = logging.getLogger(__name__) # Create circuit breakers for each provider openai_breaker = CircuitBreaker( fail_max=5, # Open after 5 failures reset_timeout=30, # Wait 30 seconds before testing recovery exclude=[ # Don't count these as failures: openai.RateLimitError, # Rate limits are expected ], name="OpenAI" ) anthropic_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, exclude=[], name="Anthropic" ) azure_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, exclude=[], name="Azure" ) class AIProviderWithCircuitBreaker: """Wrapper that adds circuit breaker to AI provider calls""" def __init__(self): self.openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"]) self.anthropic_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) def chat_completion(self, messages, model="gpt-4o-mini", **kwargs): """ Make chat completion with automatic fallback if provider is down. Tries providers in order: 1. OpenAI (primary) 2. Anthropic (fallback) 3. Azure (last resort) """ # Try OpenAI first if openai_breaker.current_state == "closed": try: return self._openai_chat(messages, model, **kwargs) except CircuitBreakerError: logger.warning("OpenAI circuit breaker is OPEN, trying fallback") except Exception as e: logger.error(f"OpenAI call failed: {e}") # Let circuit breaker handle the failure # Fallback to Anthropic if anthropic_breaker.current_state == "closed": try: return self._anthropic_chat(messages, **kwargs) except CircuitBreakerError: logger.warning("Anthropic circuit breaker is OPEN, trying last resort") except Exception as e: logger.error(f"Anthropic call failed: {e}") # Last resort: Azure if azure_breaker.current_state == "closed": try: return self._azure_chat(messages, model, **kwargs) except CircuitBreakerError: logger.error("All providers have open circuit breakers!") raise Exception("All AI providers are currently unavailable") except Exception as e: logger.error(f"Azure call failed: {e}") raise # All circuit breakers are open raise Exception("All AI providers are currently unavailable") @openai_breaker def _openai_chat(self, messages, model, **kwargs): """OpenAI API call wrapped with circuit breaker""" logger.info(f"Calling OpenAI with model {model}") response = self.openai_client.chat.completions.create( model=model, messages=messages, **kwargs ) return { 'provider': 'openai', 'model': model, 'content': response.choices[0].message.content, 'usage': { 'input_tokens': response.usage.prompt_tokens, 'output_tokens': response.usage.completion_tokens, } } @anthropic_breaker def _anthropic_chat(self, messages, **kwargs): """Anthropic API call wrapped with circuit breaker""" logger.info("Calling Anthropic with Claude 3.5 Sonnet") # Convert OpenAI message format to Anthropic format anthropic_messages = [] system_prompt = None for msg in messages: if msg['role'] == 'system': system_prompt = msg['content'] else: anthropic_messages.append({ 'role': msg['role'], 'content': msg['content'] }) response = self.anthropic_client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=kwargs.get('max_tokens', 4096), system=system_prompt, messages=anthropic_messages ) return { 'provider': 'anthropic', 'model': 'claude-3-5-sonnet-20241022', 'content': response.content[0].text, 'usage': { 'input_tokens': response.usage.input_tokens, 'output_tokens': response.usage.output_tokens, } } @azure_breaker def _azure_chat(self, messages, model, **kwargs): """Azure OpenAI API call wrapped with circuit breaker""" logger.info(f"Calling Azure OpenAI with model {model}") azure_client = openai.AzureOpenAI( api_key=os.environ["AZURE_API_KEY"], api_version="2024-02-15-preview", azure_endpoint=os.environ["AZURE_API_BASE"] ) response = azure_client.chat.completions.create( model=model, messages=messages, **kwargs ) return { 'provider': 'azure', 'model': model, 'content': response.choices[0].message.content, 'usage': { 'input_tokens': response.usage.prompt_tokens, 'output_tokens': response.usage.completion_tokens, } } def get_circuit_status(self): """Get current status of all circuit breakers""" return { 'openai': openai_breaker.current_state, 'anthropic': anthropic_breaker.current_state, 'azure': azure_breaker.current_state, } # Singleton instance ai_provider = AIProviderWithCircuitBreaker()
Step 3: Update Your Application Code
Before (Direct calls with no protection):
import openai client = openai.OpenAI(api_key="sk-...") # If OpenAI is down, this will retry indefinitely response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello"}] )
After (With circuit breaker + fallback):
from ai_circuit_breaker import ai_provider # Automatically fails fast and uses fallback if OpenAI is down response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Response from {response['provider']}: {response['content']}") print(f"Tokens used: {response['usage']}")
Step 4: Add Circuit Status Endpoint (Optional)
Monitor circuit breaker status in your application:
from flask import Flask, jsonify from ai_circuit_breaker import ai_provider app = Flask(__name__) @app.route('/health/circuit-breakers') def circuit_breaker_status(): """Endpoint to check circuit breaker status""" status = ai_provider.get_circuit_status() # Determine overall health all_open = all(state == 'open' for state in status.values()) some_open = any(state == 'open' for state in status.values()) return jsonify({ 'providers': status, 'overall_health': 'critical' if all_open else 'degraded' if some_open else 'healthy' })
Step 5: Configure Alerting
Get notified when circuit breakers trip:
# Add to ai_circuit_breaker.py def on_circuit_open(breaker): """Called when circuit breaker opens""" logger.critical(f"Circuit breaker OPENED for {breaker.name}!") # Send alert to Slack import requests requests.post( "https://hooks.slack.com/services/YOUR/WEBHOOK", json={ "text": f"π¨ Circuit Breaker OPEN: {breaker.name} provider is down", "attachments": [{ "color": "danger", "fields": [ {"title": "Provider", "value": breaker.name, "short": True}, {"title": "Failures", "value": str(breaker.fail_counter), "short": True} ] }] } ) def on_circuit_close(breaker): """Called when circuit breaker closes (recovery)""" logger.info(f"Circuit breaker CLOSED for {breaker.name} - provider recovered") # Send recovery notification import requests requests.post( "https://hooks.slack.com/services/YOUR/WEBHOOK", json={ "text": f"β Circuit Breaker CLOSED: {breaker.name} provider recovered", "attachments": [{ "color": "good", "fields": [ {"title": "Provider", "value": breaker.name, "short": True}, {"title": "Status", "value": "Operational", "short": True} ] }] } ) # Add listeners to circuit breakers openai_breaker.add_listener(on_circuit_open, on_circuit_close) anthropic_breaker.add_listener(on_circuit_open, on_circuit_close) azure_breaker.add_listener(on_circuit_open, on_circuit_close)
Testing Your Circuit Breaker
Test 1: Simulate Provider Failure
# test_circuit_breaker.py import os os.environ["OPENAI_API_KEY"] = "sk-invalid" # Use invalid key from ai_circuit_breaker import ai_provider # This should fail 5 times, then open circuit for i in range(6): try: response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Call {i+1}: Success from {response['provider']}") except Exception as e: print(f"Call {i+1}: Failed - {e}") # Check circuit status status = ai_provider.get_circuit_status() print(f"\nCircuit Status: {status}") # Should show: {'openai': 'open', 'anthropic': 'closed', 'azure': 'closed'}
Test 2: Verify Fallback
# With OpenAI circuit open, requests should go to Anthropic response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Provider used: {response['provider']}") # Should print: Provider used: anthropic
Test 3: Verify Recovery
import time # Fix OpenAI API key os.environ["OPENAI_API_KEY"] = "sk-correct-key" # Wait for recovery timeout (30 seconds) print("Waiting 30 seconds for circuit to enter half-open state...") time.sleep(30) # Next request will test recovery response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Provider used: {response['provider']}") # Should print: Provider used: openai (circuit recovered!) status = ai_provider.get_circuit_status() print(f"Circuit Status: {status}") # Should show: {'openai': 'closed', 'anthropic': 'closed', 'azure': 'closed'}
Advanced Configuration
Custom Failure Detection
Not all errors should trip the circuit breaker. Configure what counts as a failure:
from pybreaker import CircuitBreaker openai_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, # Exclude these exceptions (don't count as failures) exclude=[ openai.RateLimitError, # Expected during high traffic openai.AuthenticationError, # Configuration issue, not provider issue ValueError, # App logic error, not provider issue ], # Only count these as failures (whitelist approach) # listeners=[my_custom_failure_detector], ) def my_custom_failure_detector(exception): """Custom logic to determine if exception should trip circuit""" if isinstance(exception, openai.APIError): # Only trip for 5xx errors if hasattr(exception, 'status_code'): return 500 <= exception.status_code < 600 return False
Adaptive Thresholds
Adjust failure threshold based on traffic volume:
import time class AdaptiveCircuitBreaker: def __init__(self): self.base_fail_max = 5 self.requests_per_minute = 0 self.last_reset = time.time() self.breaker = CircuitBreaker( fail_max=self.base_fail_max, reset_timeout=30 ) def call(self, func, *args, **kwargs): # Track request rate self.requests_per_minute += 1 if time.time() - self.last_reset > 60: self.requests_per_minute = 0 self.last_reset = time.time() # Adjust threshold based on traffic # High traffic = more lenient (allow more failures) if self.requests_per_minute > 100: self.breaker._failure_threshold = 20 elif self.requests_per_minute > 50: self.breaker._failure_threshold = 10 else: self.breaker._failure_threshold = self.base_fail_max return self.breaker.call(func, *args, **kwargs)
Per-Endpoint Circuit Breakers
Different endpoints have different reliability profiles:
# Separate circuit breakers for different operations chat_breaker = CircuitBreaker(fail_max=5, reset_timeout=30, name="OpenAI-Chat") embedding_breaker = CircuitBreaker(fail_max=10, reset_timeout=60, name="OpenAI-Embeddings") image_breaker = CircuitBreaker(fail_max=3, reset_timeout=120, name="OpenAI-Images") @chat_breaker def chat_completion(...): ... @embedding_breaker def create_embedding(...): ... @image_breaker def generate_image(...): ...
Monitoring & Dashboards
Metrics to Track
-
Circuit State Changes:
- When did circuit open?
- How long was it open?
- How many times per day?
-
Failure Rate:
- Failures per minute
- Failure types (timeout, 500, etc)
- Which provider fails most?
-
Fallback Usage:
- % of requests using fallback
- Cost impact of fallbacks
-
Recovery Time:
- How quickly do circuits close?
- Are recovery tests succeeding?
Prometheus Metrics
from prometheus_client import Counter, Gauge, Histogram # Circuit breaker metrics circuit_state = Gauge('circuit_breaker_state', 'Circuit breaker state', ['provider']) circuit_failures = Counter('circuit_breaker_failures', 'Failures counted', ['provider']) circuit_state_changes = Counter('circuit_breaker_state_changes', 'State transitions', ['provider', 'from_state', 'to_state']) def on_circuit_state_change(breaker, old_state, new_state): circuit_state.labels(provider=breaker.name).set( 1 if new_state == 'open' else 0.5 if new_state == 'half_open' else 0 ) circuit_state_changes.labels( provider=breaker.name, from_state=old_state, to_state=new_state ).inc()
Grafana Dashboard
Create a dashboard with:
- Circuit state over time (closed/open/half-open)
- Failure rate by provider
- Fallback usage percentage
- Recovery time histogram
Production Checklist
Before deploying circuit breakers to production:
- Circuit breaker thresholds tuned for your traffic
- Fallback providers configured and tested
- Alerting set up (Slack/PagerDuty)
- Monitoring dashboard created
- Load tested with simulated failures
- Recovery timeout is appropriate (not too short)
- Team trained on what to do when circuit opens
- Runbook created for manual intervention
- Circuit status exposed in health endpoint
- Tested with all possible error types
Expected Results
Without Circuit Breakers:
- Provider outage scenario:
- 10,000 requests retry indefinitely
- App becomes unresponsive
- $50,000 in wasted API calls
- 2 hour recovery time
- Customer complaints
With Circuit Breakers:
- Same outage scenario:
- 5 requests fail, circuit opens
- 9,995 requests fast-fail to fallback
- App remains responsive
- $250 in failed calls, rest goes to fallback
- 30 second automatic recovery
- Users barely notice
Cost Savings in Outages:
- Prevented: $49,750
- Time Saved: 1h 59m 30s
- User Impact: Minimal
Reliability Improvement:
- 99.9% β 99.99% uptime
- Mean Time To Recovery: 2 hours β 30 seconds
Troubleshooting
Circuit breaker not opening
Possible causes:
- Failure threshold too high
- Errors being excluded
- Window size too large
Fix:
# Lower threshold for testing CircuitBreaker( fail_max=3, # Was 5 window_size=30, # Was 60 exclude=[] # Don't exclude any errors during testing )
Circuit breaker opens too easily
Possible causes:
- Threshold too low for traffic volume
- Counting expected errors as failures
Fix:
# Increase threshold or exclude expected errors CircuitBreaker( fail_max=10, # Was 5 exclude=[ RateLimitError, TimeoutError, # If timeouts are common ] )
Recovery testing too aggressive
Symptom: Circuit repeatedly opens and closes
Fix:
# Increase recovery timeout CircuitBreaker( reset_timeout=60, # Was 30 success_threshold=3 # Need 3 successes to fully recover )
Next Steps
Once circuit breakers are working:
-
Add Caching (see Caching Implementation Guide)
- Serve cached responses when circuit is open
- Reduce dependency on live providers
-
Implement Retry with Backoff (see Retry Strategies Guide)
- Intelligent retries before opening circuit
- Exponential backoff
-
Set Up Comprehensive Monitoring (see Monitoring Guide)
- Track circuit state in real-time
- Alert on concerning patterns
Additional Resources
- PyBreaker Documentation: https://github.com/danielfm/pybreaker
- Circuit Breaker Pattern: https://martinfowler.com/bliki/CircuitBreaker.html
- Netflix Hystrix (Reference): https://github.com/Netflix/Hystrix/wiki
- Cloud Design Patterns: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
Support
Need help implementing circuit breakers?
- Onaro Support: support@onaro.io
- Book implementation call: https://onaro.io/support
- Community: https://community.onaro.io
Estimated Implementation Time: 2-3 hours
Difficulty: βββββ (3/5)
Impact: πππππ (5/5 - Prevents catastrophic failures)
Last Updated: January 26, 2026
Tested with: pybreaker 1.0.1, OpenAI SDK 1.12.0, Anthropic SDK 0.18.0