πŸ”„

Edge Proxy

Implement request routing & load balancing for AI APIs

Time: 2-4 hoursDifficulty: IntermediatePotential Savings: $500-2,000/month

Best For: Organizations with >100K API calls/month

Edge Proxy Implementation Guide

Implement Request Routing & Load Balancing for AI APIs

Difficulty: Intermediate
Time Required: 2-4 hours
Potential Savings: $500-2,000/month (depending on volume)
Best For: Organizations with >100K API calls/month


What is an Edge Proxy?

An Edge Proxy sits between your application and AI provider APIs, intelligently routing requests to optimize for:

  • Cost (cheapest provider for the task)
  • Latency (fastest provider for your region)
  • Availability (failover if primary provider is down)
  • Rate limits (distribute load across multiple providers)

How It Works:

Your App β†’ Edge Proxy β†’ [Route Decision] β†’ Best AI Provider
                              ↓
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    ↓         ↓         ↓
                 OpenAI   Anthropic   Azure

The proxy analyzes each request and routes to the optimal provider based on your rules.


Prerequisites

Before starting, ensure you have:

  • Multiple AI provider API keys (at least 2 providers)
  • Docker installed (or a server to run the proxy)
  • Basic knowledge of HTTP APIs and environment variables
  • Access to your application's AI API call code

Recommended Providers to Set Up:

  • OpenAI (primary)
  • Anthropic Claude (fallback/cost-effective)
  • Azure OpenAI (enterprise/compliance)

Implementation Steps

Step 1: Choose Your Edge Proxy Solution

Option A: LiteLLM Proxy (Recommended for Most)

Best for: Easy setup, supports 100+ providers, built-in load balancing

Option B: Custom Proxy (Advanced)

Best for: Full control, custom routing logic, specific requirements

Option C: Portkey.ai (Managed Service)

Best for: No infrastructure management, enterprise support

We'll use LiteLLM for this guide (most popular open-source option)


Step 2: Install LiteLLM Proxy

Using Docker (Recommended):

# Pull the latest LiteLLM image docker pull ghcr.io/berriai/litellm:main-latest # Create configuration directory mkdir -p ~/litellm-config cd ~/litellm-config # Create config file (we'll populate this next) touch litellm_config.yaml

Alternative: Install via pip:

pip install 'litellm[proxy]' --break-system-packages # Verify installation litellm --version

Step 3: Configure Provider Routing

Create litellm_config.yaml:

model_list: # Fast & Cheap: GPT-4o-mini for simple tasks - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY model_info: cost_per_1k_input_tokens: 0.00015 cost_per_1k_output_tokens: 0.0006 # Balanced: Claude 3.5 Sonnet for complex tasks - model_name: claude-3-5-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: cost_per_1k_input_tokens: 0.003 cost_per_1k_output_tokens: 0.015 # Fallback: Azure OpenAI (enterprise compliance) - model_name: gpt-4o-azure litellm_params: model: azure/gpt-4o api_base: os.environ/AZURE_API_BASE api_key: os.environ/AZURE_API_KEY api_version: "2024-02-15-preview" # Load balancing strategy router_settings: routing_strategy: least-cost # Options: least-cost, least-latency, simple-shuffle num_retries: 2 timeout: 60 fallbacks: - gpt-4o-mini - claude-3-5-sonnet - gpt-4o-azure # Rate limiting (optional) litellm_settings: max_parallel_requests: 100 # Logging (for debugging) general_settings: master_key: your-secret-key-here # Change this! database_url: postgresql://localhost/litellm # Optional: for request logging

Key Configuration Decisions:

Routing Strategy:

  • least-cost: Always route to cheapest provider (recommended for batch processing)
  • least-latency: Route to fastest provider (recommended for user-facing apps)
  • simple-shuffle: Distribute load evenly (recommended for rate limit management)

Fallback Chain:

  • Primary fails β†’ Try secondary β†’ Try tertiary
  • Prevents total outage if one provider is down

Step 4: Set Up Environment Variables

Create .env file:

# OpenAI OPENAI_API_KEY=sk-proj-... # Anthropic ANTHROPIC_API_KEY=sk-ant-... # Azure OpenAI (if using) AZURE_API_KEY=... AZURE_API_BASE=https://your-resource.openai.azure.com AZURE_DEPLOYMENT_NAME=gpt-4o # Proxy master key (for authentication) LITELLM_MASTER_KEY=sk-1234567890abcdef

Security Note: Never commit .env to git. Add to .gitignore.


Step 5: Start the Edge Proxy

Using Docker:

docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ --env-file .env \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml

Using pip:

litellm --config litellm_config.yaml

Verify it's running:

curl http://localhost:4000/health # Should return: {"status": "healthy"}

Step 6: Update Your Application Code

Before (Direct OpenAI calls):

import openai client = openai.OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello!"}] )

After (Via Edge Proxy):

import openai # Point to your proxy instead of OpenAI directly client = openai.OpenAI( api_key="sk-1234567890abcdef", # Your proxy master key base_url="http://localhost:4000" # Your proxy URL ) # Same code as before - no other changes needed! response = client.chat.completions.create( model="gpt-4o-mini", # Proxy will route intelligently messages=[{"role": "user", "content": "Hello!"}] )

That's it! Your app now routes through the proxy with zero code changes.


Step 7: Test the Routing

Test cost-based routing:

# This should route to cheapest provider (gpt-4o-mini) curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }'

Test fallback:

# Temporarily disable OpenAI key to test fallback # Request should automatically retry with Anthropic

Check logs:

docker logs litellm-proxy # Look for routing decisions: # [INFO] Routing request to openai/gpt-4o-mini (cost: $0.0001) # [INFO] Request successful (latency: 234ms)

Step 8: Advanced Configuration (Optional)

A) Custom Routing Rules

Route based on input characteristics:

# In litellm_config.yaml router_settings: routing_strategy: custom custom_routing: # Short prompts (<100 tokens) β†’ cheapest - condition: input_tokens < 100 route_to: gpt-4o-mini # Long prompts (>1000 tokens) β†’ Claude (better context) - condition: input_tokens > 1000 route_to: claude-3-5-sonnet # Default β†’ balanced option - default: gpt-4o-mini

B) A/B Testing

Test new models on a percentage of traffic:

router_settings: routing_strategy: weighted model_weights: gpt-4o-mini: 0.8 # 80% of traffic claude-3-5-haiku: 0.2 # 20% of traffic (testing)

C) Regional Routing

Route to closest provider for latency:

regional_routing: us-east: azure/gpt-4o-us-east eu-west: azure/gpt-4o-eu-west asia: anthropic/claude-3-5-sonnet # Anthropic has good APAC latency

D) Rate Limit Management

Distribute load to avoid hitting limits:

rate_limit_settings: openai: max_requests_per_minute: 3500 max_tokens_per_minute: 90000 anthropic: max_requests_per_minute: 4000 max_tokens_per_minute: 400000 # Proxy automatically switches providers when nearing limits

Step 9: Monitor & Optimize

View Real-Time Metrics:

LiteLLM provides a built-in dashboard:

# Access at http://localhost:4000/ui # Shows: # - Requests per provider # - Cost breakdown # - Latency by provider # - Error rates

Key Metrics to Track:

  1. Cost Savings:

    • Before: All requests to GPT-4o
    • After: 70% routed to GPT-4o-mini (10x cheaper)
    • Savings: ~$1,500/month on 100K calls
  2. Latency:

    • Average response time by provider
    • P95 latency (95th percentile)
    • Timeout rate
  3. Reliability:

    • Fallback success rate
    • Provider uptime
    • Error rate by provider
  4. Cost per Request:

    • Track over time
    • Should decrease as routing optimizes

Step 10: Deploy to Production

Option A: Docker Compose (Simple)

# docker-compose.yml version: '3.8' services: litellm-proxy: image: ghcr.io/berriai/litellm:main-latest ports: - "4000:4000" volumes: - ./litellm_config.yaml:/app/config.yaml env_file: - .env restart: unless-stopped command: --config /app/config.yaml

Deploy:

docker-compose up -d

Option B: Kubernetes (Enterprise)

# k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: litellm-proxy spec: replicas: 3 # For high availability selector: matchLabels: app: litellm-proxy template: metadata: labels: app: litellm-proxy spec: containers: - name: litellm image: ghcr.io/berriai/litellm:main-latest ports: - containerPort: 4000 volumeMounts: - name: config mountPath: /app/config.yaml subPath: litellm_config.yaml envFrom: - secretRef: name: litellm-secrets volumes: - name: config configMap: name: litellm-config --- apiVersion: v1 kind: Service metadata: name: litellm-proxy spec: type: LoadBalancer ports: - port: 4000 targetPort: 4000 selector: app: litellm-proxy

Deploy:

kubectl apply -f k8s/

Option C: Managed Service (No Infrastructure)

Use Portkey.ai or LiteLLM Cloud:

  • No server management
  • Built-in observability
  • Enterprise support
  • Higher cost but zero ops

Expected Results

Before Edge Proxy:

  • 100% of requests to OpenAI GPT-4o
  • Cost: $0.005/1K input tokens
  • Monthly cost (1M tokens): $5,000
  • Single point of failure

After Edge Proxy:

  • 70% routed to GPT-4o-mini: $0.00015/1K tokens
  • 20% routed to Claude Sonnet: $0.003/1K tokens
  • 10% remain on GPT-4o: $0.005/1K tokens
  • Monthly cost: ~$1,200 (76% savings!)
  • Multi-provider redundancy

Additional Benefits:

  • No downtime if OpenAI has issues
  • Automatic rate limit management
  • Easy A/B testing of new models
  • Centralized logging and monitoring

Troubleshooting

Issue: Proxy returns 401 Unauthorized

Cause: Incorrect master key

Fix:

# Check your .env file cat .env | grep LITELLM_MASTER_KEY # Make sure your app uses this key: client = openai.OpenAI(api_key="your-master-key-here")

Issue: Requests still going directly to OpenAI

Cause: base_url not set correctly

Fix:

# Ensure base_url points to proxy client = openai.OpenAI( base_url="http://localhost:4000", # ← Must be set api_key="your-master-key" )

Issue: Fallback not working

Cause: Timeout too high or fallback chain not configured

Fix:

# In litellm_config.yaml router_settings: timeout: 30 # Lower timeout triggers fallback faster num_retries: 2 fallbacks: - gpt-4o-mini - claude-3-5-sonnet # ← Must list all fallback options

Issue: High latency after adding proxy

Cause: Proxy not co-located with app

Fix:

  • Deploy proxy close to your application (same region)
  • Use connection pooling
  • Enable keep-alive connections

Issue: Cost not decreasing

Cause: Wrong routing strategy

Fix:

# Change to cost-optimized routing router_settings: routing_strategy: least-cost # ← Prioritize cost

Testing Checklist

Before deploying to production:

  • Proxy responds to health checks
  • Can authenticate with master key
  • Requests route to primary provider
  • Fallback triggers when primary fails
  • Latency is acceptable (<500ms added)
  • Logging captures request metadata
  • Dashboard shows metrics correctly
  • Cost tracking is accurate
  • All API keys are in environment variables (not hardcoded)
  • Proxy restarts automatically if it crashes

Maintenance

Weekly:

  • Review cost savings vs last week
  • Check error rates by provider
  • Verify fallback success rate

Monthly:

  • Update LiteLLM to latest version
  • Review and optimize routing rules
  • Test new models in A/B config
  • Analyze latency trends

Quarterly:

  • Re-evaluate provider pricing (providers change prices)
  • Audit API key rotation
  • Review capacity planning

Next Steps

Once your Edge Proxy is running:

  1. Add Circuit Breakers (see Circuit Breaker Implementation Guide)

    • Prevent cascade failures
    • Automatic provider disabling
  2. Implement Caching (see Caching Implementation Guide)

    • Cache repeated requests
    • Save 30-50% on duplicate queries
  3. Set Up Monitoring (see Monitoring Setup Guide)

    • Grafana dashboards
    • Alert on high error rates

Additional Resources


Support

Questions about this implementation?

Estimated Implementation Time: 2-4 hours
Difficulty: β­β­β­β˜†β˜† (3/5)
Impact: πŸš€πŸš€πŸš€πŸš€πŸš€ (5/5)


Last Updated: January 26, 2026
Tested with: LiteLLM v1.25.0, OpenAI SDK 1.12.0, Anthropic SDK 0.18.0