⚑

Circuit Breakers

Prevent cascading failures and reduce costs during outages

Time: 2-3 hoursDifficulty: IntermediatePotential Savings: $200-1,000/month

Best For: Production systems with high availability requirements

Circuit Breaker Implementation Guide

Prevent Cascade Failures & Improve AI API Reliability

Difficulty: Intermediate
Time Required: 2-3 hours
Potential Savings: Prevents catastrophic cost overruns (up to $10K+ saved in outage scenarios)
Best For: All production applications using AI APIs


What is a Circuit Breaker?

A Circuit Breaker is a design pattern that prevents your application from repeatedly calling a failing service, protecting your system from:

  • Cascade failures (one failing API causes your entire app to fail)
  • Cost overruns (retrying failed requests thousands of times)
  • Poor user experience (long timeouts instead of fast failures)
  • Resource exhaustion (threads/connections stuck waiting)

The Problem Without Circuit Breakers:

Your App β†’ [Call OpenAI] β†’ 500 Error
Your App β†’ [Retry] β†’ 500 Error
Your App β†’ [Retry] β†’ 500 Error
Your App β†’ [Retry] β†’ 500 Error  (repeats 1000x)

Result: 
- App becomes unresponsive
- $5,000 in wasted API calls
- 30 minute recovery time

The Solution With Circuit Breakers:

Your App β†’ [Call OpenAI] β†’ 500 Error
Your App β†’ [Retry] β†’ 500 Error
Your App β†’ [Retry] β†’ 500 Error

Circuit Breaker: "Provider is down, stop trying!"
Circuit Status: OPEN (requests blocked)

Your App β†’ [Fast fail] β†’ Use fallback provider
Your App β†’ [Fast fail] β†’ Use cached response
Your App β†’ [Fast fail] β†’ Show user friendly error

After 30 seconds:
Circuit Status: HALF-OPEN (test if provider recovered)
Your App β†’ [Test call] β†’ Success! β†’ Circuit CLOSED

Result:

  • App remains responsive
  • Only 3 failed calls instead of 1000
  • Automatic recovery
  • $4,997 saved

How Circuit Breakers Work

Three States:

  1. CLOSED (Normal operation)

    • All requests pass through
    • Monitoring for failures
  2. OPEN (Provider is down)

    • All requests immediately fail
    • No calls to failing provider
    • Save time and money
  3. HALF-OPEN (Testing recovery)

    • Allow one test request
    • If success β†’ CLOSED
    • If failure β†’ OPEN again

State Transitions:

      [Normal]
        ↓
      CLOSED ←─────────────┐
        ↓                   β”‚
   [Too many               β”‚
    failures]           [Success]
        ↓                   β”‚
       OPEN ────────→  HALF-OPEN
    [Wait 30s]         [Test call]
                           β”‚
                      [Failure]
                           ↓
                         OPEN

Prerequisites

Before implementing:

  • Edge Proxy deployed (see Edge Proxy Guide) - OR -
  • Direct AI API integration in your application
  • Python 3.8+ (for code examples) or your language of choice
  • Basic understanding of error handling and retries

Implementation Steps

Step 1: Choose Your Implementation Approach

Option A: Use LiteLLM Proxy (Recommended if you have Edge Proxy)

Circuit breakers built-in, no code needed.

Option B: Application-Level Circuit Breaker (For direct API calls)

Add circuit breaker to your existing code using pybreaker library.

Option C: Service Mesh (Advanced/Enterprise)

Use Istio or Linkerd for circuit breakers at infrastructure level.

We'll cover Options A and B in this guide.


Option A: Circuit Breakers in LiteLLM Proxy

If you already have an Edge Proxy (from the Edge Proxy Implementation Guide), circuit breakers are built-in.

Step 1: Enable Circuit Breakers in Config

Update your litellm_config.yaml:

model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY - model_name: claude-3-5-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY # Circuit breaker configuration router_settings: routing_strategy: least-cost # Circuit breaker settings circuit_breaker: enabled: true # Failure threshold failure_threshold: 5 # Open circuit after 5 failures # Time window for counting failures window_size: 60 # Count failures in last 60 seconds # Recovery timeout recovery_timeout: 30 # Wait 30 seconds before testing recovery # Success threshold for recovery success_threshold: 2 # Need 2 successes to close circuit # What counts as a failure? failure_conditions: - status_code: 500 - status_code: 502 - status_code: 503 - status_code: 504 - timeout: true - rate_limit: true # Treat rate limits as failures # Fallback when circuit is open fallbacks: - claude-3-5-sonnet # Use Claude if OpenAI is down - gpt-4o-azure # Then try Azure # Alert when circuit opens alerting: webhook_url: https://your-app.com/webhooks/circuit-breaker slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK

Step 2: Restart Proxy

docker restart litellm-proxy # Or if using docker-compose docker-compose restart

Step 3: Test Circuit Breaker

Simulate provider failure:

# Send 6 requests with invalid API key (will fail) for i in {1..6}; do curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' sleep 1 done # Check circuit status curl http://localhost:4000/metrics | grep circuit_breaker # Should show: circuit_breaker{provider="openai"} = 1 (1 = OPEN)

Verify fallback:

# This request should now go to Claude (fallback) curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' # Check response headers # X-LiteLLM-Provider: anthropic (routed to fallback)

Wait for recovery test:

# Wait 30 seconds, then send request sleep 30 # This will test if OpenAI recovered curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }' # If successful, circuit closes automatically

Option B: Application-Level Circuit Breaker

If you're making direct API calls (no proxy), implement circuit breakers in your application code.

Step 1: Install Circuit Breaker Library

pip install pybreaker --break-system-packages

Step 2: Create Circuit Breaker Wrapper

Create ai_circuit_breaker.py:

import logging from functools import wraps from pybreaker import CircuitBreaker, CircuitBreakerError import openai from anthropic import Anthropic logger = logging.getLogger(__name__) # Create circuit breakers for each provider openai_breaker = CircuitBreaker( fail_max=5, # Open after 5 failures reset_timeout=30, # Wait 30 seconds before testing recovery exclude=[ # Don't count these as failures: openai.RateLimitError, # Rate limits are expected ], name="OpenAI" ) anthropic_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, exclude=[], name="Anthropic" ) azure_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, exclude=[], name="Azure" ) class AIProviderWithCircuitBreaker: """Wrapper that adds circuit breaker to AI provider calls""" def __init__(self): self.openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"]) self.anthropic_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) def chat_completion(self, messages, model="gpt-4o-mini", **kwargs): """ Make chat completion with automatic fallback if provider is down. Tries providers in order: 1. OpenAI (primary) 2. Anthropic (fallback) 3. Azure (last resort) """ # Try OpenAI first if openai_breaker.current_state == "closed": try: return self._openai_chat(messages, model, **kwargs) except CircuitBreakerError: logger.warning("OpenAI circuit breaker is OPEN, trying fallback") except Exception as e: logger.error(f"OpenAI call failed: {e}") # Let circuit breaker handle the failure # Fallback to Anthropic if anthropic_breaker.current_state == "closed": try: return self._anthropic_chat(messages, **kwargs) except CircuitBreakerError: logger.warning("Anthropic circuit breaker is OPEN, trying last resort") except Exception as e: logger.error(f"Anthropic call failed: {e}") # Last resort: Azure if azure_breaker.current_state == "closed": try: return self._azure_chat(messages, model, **kwargs) except CircuitBreakerError: logger.error("All providers have open circuit breakers!") raise Exception("All AI providers are currently unavailable") except Exception as e: logger.error(f"Azure call failed: {e}") raise # All circuit breakers are open raise Exception("All AI providers are currently unavailable") @openai_breaker def _openai_chat(self, messages, model, **kwargs): """OpenAI API call wrapped with circuit breaker""" logger.info(f"Calling OpenAI with model {model}") response = self.openai_client.chat.completions.create( model=model, messages=messages, **kwargs ) return { 'provider': 'openai', 'model': model, 'content': response.choices[0].message.content, 'usage': { 'input_tokens': response.usage.prompt_tokens, 'output_tokens': response.usage.completion_tokens, } } @anthropic_breaker def _anthropic_chat(self, messages, **kwargs): """Anthropic API call wrapped with circuit breaker""" logger.info("Calling Anthropic with Claude 3.5 Sonnet") # Convert OpenAI message format to Anthropic format anthropic_messages = [] system_prompt = None for msg in messages: if msg['role'] == 'system': system_prompt = msg['content'] else: anthropic_messages.append({ 'role': msg['role'], 'content': msg['content'] }) response = self.anthropic_client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=kwargs.get('max_tokens', 4096), system=system_prompt, messages=anthropic_messages ) return { 'provider': 'anthropic', 'model': 'claude-3-5-sonnet-20241022', 'content': response.content[0].text, 'usage': { 'input_tokens': response.usage.input_tokens, 'output_tokens': response.usage.output_tokens, } } @azure_breaker def _azure_chat(self, messages, model, **kwargs): """Azure OpenAI API call wrapped with circuit breaker""" logger.info(f"Calling Azure OpenAI with model {model}") azure_client = openai.AzureOpenAI( api_key=os.environ["AZURE_API_KEY"], api_version="2024-02-15-preview", azure_endpoint=os.environ["AZURE_API_BASE"] ) response = azure_client.chat.completions.create( model=model, messages=messages, **kwargs ) return { 'provider': 'azure', 'model': model, 'content': response.choices[0].message.content, 'usage': { 'input_tokens': response.usage.prompt_tokens, 'output_tokens': response.usage.completion_tokens, } } def get_circuit_status(self): """Get current status of all circuit breakers""" return { 'openai': openai_breaker.current_state, 'anthropic': anthropic_breaker.current_state, 'azure': azure_breaker.current_state, } # Singleton instance ai_provider = AIProviderWithCircuitBreaker()

Step 3: Update Your Application Code

Before (Direct calls with no protection):

import openai client = openai.OpenAI(api_key="sk-...") # If OpenAI is down, this will retry indefinitely response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello"}] )

After (With circuit breaker + fallback):

from ai_circuit_breaker import ai_provider # Automatically fails fast and uses fallback if OpenAI is down response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Response from {response['provider']}: {response['content']}") print(f"Tokens used: {response['usage']}")

Step 4: Add Circuit Status Endpoint (Optional)

Monitor circuit breaker status in your application:

from flask import Flask, jsonify from ai_circuit_breaker import ai_provider app = Flask(__name__) @app.route('/health/circuit-breakers') def circuit_breaker_status(): """Endpoint to check circuit breaker status""" status = ai_provider.get_circuit_status() # Determine overall health all_open = all(state == 'open' for state in status.values()) some_open = any(state == 'open' for state in status.values()) return jsonify({ 'providers': status, 'overall_health': 'critical' if all_open else 'degraded' if some_open else 'healthy' })

Step 5: Configure Alerting

Get notified when circuit breakers trip:

# Add to ai_circuit_breaker.py def on_circuit_open(breaker): """Called when circuit breaker opens""" logger.critical(f"Circuit breaker OPENED for {breaker.name}!") # Send alert to Slack import requests requests.post( "https://hooks.slack.com/services/YOUR/WEBHOOK", json={ "text": f"🚨 Circuit Breaker OPEN: {breaker.name} provider is down", "attachments": [{ "color": "danger", "fields": [ {"title": "Provider", "value": breaker.name, "short": True}, {"title": "Failures", "value": str(breaker.fail_counter), "short": True} ] }] } ) def on_circuit_close(breaker): """Called when circuit breaker closes (recovery)""" logger.info(f"Circuit breaker CLOSED for {breaker.name} - provider recovered") # Send recovery notification import requests requests.post( "https://hooks.slack.com/services/YOUR/WEBHOOK", json={ "text": f"βœ… Circuit Breaker CLOSED: {breaker.name} provider recovered", "attachments": [{ "color": "good", "fields": [ {"title": "Provider", "value": breaker.name, "short": True}, {"title": "Status", "value": "Operational", "short": True} ] }] } ) # Add listeners to circuit breakers openai_breaker.add_listener(on_circuit_open, on_circuit_close) anthropic_breaker.add_listener(on_circuit_open, on_circuit_close) azure_breaker.add_listener(on_circuit_open, on_circuit_close)

Testing Your Circuit Breaker

Test 1: Simulate Provider Failure

# test_circuit_breaker.py import os os.environ["OPENAI_API_KEY"] = "sk-invalid" # Use invalid key from ai_circuit_breaker import ai_provider # This should fail 5 times, then open circuit for i in range(6): try: response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Call {i+1}: Success from {response['provider']}") except Exception as e: print(f"Call {i+1}: Failed - {e}") # Check circuit status status = ai_provider.get_circuit_status() print(f"\nCircuit Status: {status}") # Should show: {'openai': 'open', 'anthropic': 'closed', 'azure': 'closed'}

Test 2: Verify Fallback

# With OpenAI circuit open, requests should go to Anthropic response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Provider used: {response['provider']}") # Should print: Provider used: anthropic

Test 3: Verify Recovery

import time # Fix OpenAI API key os.environ["OPENAI_API_KEY"] = "sk-correct-key" # Wait for recovery timeout (30 seconds) print("Waiting 30 seconds for circuit to enter half-open state...") time.sleep(30) # Next request will test recovery response = ai_provider.chat_completion( messages=[{"role": "user", "content": "Hello"}], model="gpt-4o-mini" ) print(f"Provider used: {response['provider']}") # Should print: Provider used: openai (circuit recovered!) status = ai_provider.get_circuit_status() print(f"Circuit Status: {status}") # Should show: {'openai': 'closed', 'anthropic': 'closed', 'azure': 'closed'}

Advanced Configuration

Custom Failure Detection

Not all errors should trip the circuit breaker. Configure what counts as a failure:

from pybreaker import CircuitBreaker openai_breaker = CircuitBreaker( fail_max=5, reset_timeout=30, # Exclude these exceptions (don't count as failures) exclude=[ openai.RateLimitError, # Expected during high traffic openai.AuthenticationError, # Configuration issue, not provider issue ValueError, # App logic error, not provider issue ], # Only count these as failures (whitelist approach) # listeners=[my_custom_failure_detector], ) def my_custom_failure_detector(exception): """Custom logic to determine if exception should trip circuit""" if isinstance(exception, openai.APIError): # Only trip for 5xx errors if hasattr(exception, 'status_code'): return 500 <= exception.status_code < 600 return False

Adaptive Thresholds

Adjust failure threshold based on traffic volume:

import time class AdaptiveCircuitBreaker: def __init__(self): self.base_fail_max = 5 self.requests_per_minute = 0 self.last_reset = time.time() self.breaker = CircuitBreaker( fail_max=self.base_fail_max, reset_timeout=30 ) def call(self, func, *args, **kwargs): # Track request rate self.requests_per_minute += 1 if time.time() - self.last_reset > 60: self.requests_per_minute = 0 self.last_reset = time.time() # Adjust threshold based on traffic # High traffic = more lenient (allow more failures) if self.requests_per_minute > 100: self.breaker._failure_threshold = 20 elif self.requests_per_minute > 50: self.breaker._failure_threshold = 10 else: self.breaker._failure_threshold = self.base_fail_max return self.breaker.call(func, *args, **kwargs)

Per-Endpoint Circuit Breakers

Different endpoints have different reliability profiles:

# Separate circuit breakers for different operations chat_breaker = CircuitBreaker(fail_max=5, reset_timeout=30, name="OpenAI-Chat") embedding_breaker = CircuitBreaker(fail_max=10, reset_timeout=60, name="OpenAI-Embeddings") image_breaker = CircuitBreaker(fail_max=3, reset_timeout=120, name="OpenAI-Images") @chat_breaker def chat_completion(...): ... @embedding_breaker def create_embedding(...): ... @image_breaker def generate_image(...): ...

Monitoring & Dashboards

Metrics to Track

  1. Circuit State Changes:

    • When did circuit open?
    • How long was it open?
    • How many times per day?
  2. Failure Rate:

    • Failures per minute
    • Failure types (timeout, 500, etc)
    • Which provider fails most?
  3. Fallback Usage:

    • % of requests using fallback
    • Cost impact of fallbacks
  4. Recovery Time:

    • How quickly do circuits close?
    • Are recovery tests succeeding?

Prometheus Metrics

from prometheus_client import Counter, Gauge, Histogram # Circuit breaker metrics circuit_state = Gauge('circuit_breaker_state', 'Circuit breaker state', ['provider']) circuit_failures = Counter('circuit_breaker_failures', 'Failures counted', ['provider']) circuit_state_changes = Counter('circuit_breaker_state_changes', 'State transitions', ['provider', 'from_state', 'to_state']) def on_circuit_state_change(breaker, old_state, new_state): circuit_state.labels(provider=breaker.name).set( 1 if new_state == 'open' else 0.5 if new_state == 'half_open' else 0 ) circuit_state_changes.labels( provider=breaker.name, from_state=old_state, to_state=new_state ).inc()

Grafana Dashboard

Create a dashboard with:

  • Circuit state over time (closed/open/half-open)
  • Failure rate by provider
  • Fallback usage percentage
  • Recovery time histogram

Production Checklist

Before deploying circuit breakers to production:

  • Circuit breaker thresholds tuned for your traffic
  • Fallback providers configured and tested
  • Alerting set up (Slack/PagerDuty)
  • Monitoring dashboard created
  • Load tested with simulated failures
  • Recovery timeout is appropriate (not too short)
  • Team trained on what to do when circuit opens
  • Runbook created for manual intervention
  • Circuit status exposed in health endpoint
  • Tested with all possible error types

Expected Results

Without Circuit Breakers:

  • Provider outage scenario:
    • 10,000 requests retry indefinitely
    • App becomes unresponsive
    • $50,000 in wasted API calls
    • 2 hour recovery time
    • Customer complaints

With Circuit Breakers:

  • Same outage scenario:
    • 5 requests fail, circuit opens
    • 9,995 requests fast-fail to fallback
    • App remains responsive
    • $250 in failed calls, rest goes to fallback
    • 30 second automatic recovery
    • Users barely notice

Cost Savings in Outages:

  • Prevented: $49,750
  • Time Saved: 1h 59m 30s
  • User Impact: Minimal

Reliability Improvement:

  • 99.9% β†’ 99.99% uptime
  • Mean Time To Recovery: 2 hours β†’ 30 seconds

Troubleshooting

Circuit breaker not opening

Possible causes:

  1. Failure threshold too high
  2. Errors being excluded
  3. Window size too large

Fix:

# Lower threshold for testing CircuitBreaker( fail_max=3, # Was 5 window_size=30, # Was 60 exclude=[] # Don't exclude any errors during testing )

Circuit breaker opens too easily

Possible causes:

  1. Threshold too low for traffic volume
  2. Counting expected errors as failures

Fix:

# Increase threshold or exclude expected errors CircuitBreaker( fail_max=10, # Was 5 exclude=[ RateLimitError, TimeoutError, # If timeouts are common ] )

Recovery testing too aggressive

Symptom: Circuit repeatedly opens and closes

Fix:

# Increase recovery timeout CircuitBreaker( reset_timeout=60, # Was 30 success_threshold=3 # Need 3 successes to fully recover )

Next Steps

Once circuit breakers are working:

  1. Add Caching (see Caching Implementation Guide)

    • Serve cached responses when circuit is open
    • Reduce dependency on live providers
  2. Implement Retry with Backoff (see Retry Strategies Guide)

    • Intelligent retries before opening circuit
    • Exponential backoff
  3. Set Up Comprehensive Monitoring (see Monitoring Guide)

    • Track circuit state in real-time
    • Alert on concerning patterns

Additional Resources


Support

Need help implementing circuit breakers?

Estimated Implementation Time: 2-3 hours
Difficulty: β­β­β­β˜†β˜† (3/5)
Impact: πŸš€πŸš€πŸš€πŸš€πŸš€ (5/5 - Prevents catastrophic failures)


Last Updated: January 26, 2026
Tested with: pybreaker 1.0.1, OpenAI SDK 1.12.0, Anthropic SDK 0.18.0