Edge Proxy
Implement request routing & load balancing for AI APIs
Best For: Organizations with >100K API calls/month
Edge Proxy Implementation Guide
Implement Request Routing & Load Balancing for AI APIs
Difficulty: Intermediate
Time Required: 2-4 hours
Potential Savings: $500-2,000/month (depending on volume)
Best For: Organizations with >100K API calls/month
What is an Edge Proxy?
An Edge Proxy sits between your application and AI provider APIs, intelligently routing requests to optimize for:
- Cost (cheapest provider for the task)
- Latency (fastest provider for your region)
- Availability (failover if primary provider is down)
- Rate limits (distribute load across multiple providers)
How It Works:
Your App β Edge Proxy β [Route Decision] β Best AI Provider
β
βββββββββββΌββββββββββ
β β β
OpenAI Anthropic Azure
The proxy analyzes each request and routes to the optimal provider based on your rules.
Prerequisites
Before starting, ensure you have:
- Multiple AI provider API keys (at least 2 providers)
- Docker installed (or a server to run the proxy)
- Basic knowledge of HTTP APIs and environment variables
- Access to your application's AI API call code
Recommended Providers to Set Up:
- OpenAI (primary)
- Anthropic Claude (fallback/cost-effective)
- Azure OpenAI (enterprise/compliance)
Implementation Steps
Step 1: Choose Your Edge Proxy Solution
Option A: LiteLLM Proxy (Recommended for Most)
Best for: Easy setup, supports 100+ providers, built-in load balancing
Option B: Custom Proxy (Advanced)
Best for: Full control, custom routing logic, specific requirements
Option C: Portkey.ai (Managed Service)
Best for: No infrastructure management, enterprise support
We'll use LiteLLM for this guide (most popular open-source option)
Step 2: Install LiteLLM Proxy
Using Docker (Recommended):
# Pull the latest LiteLLM image docker pull ghcr.io/berriai/litellm:main-latest # Create configuration directory mkdir -p ~/litellm-config cd ~/litellm-config # Create config file (we'll populate this next) touch litellm_config.yaml
Alternative: Install via pip:
pip install 'litellm[proxy]' --break-system-packages # Verify installation litellm --version
Step 3: Configure Provider Routing
Create litellm_config.yaml:
model_list: # Fast & Cheap: GPT-4o-mini for simple tasks - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY model_info: cost_per_1k_input_tokens: 0.00015 cost_per_1k_output_tokens: 0.0006 # Balanced: Claude 3.5 Sonnet for complex tasks - model_name: claude-3-5-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-20241022 api_key: os.environ/ANTHROPIC_API_KEY model_info: cost_per_1k_input_tokens: 0.003 cost_per_1k_output_tokens: 0.015 # Fallback: Azure OpenAI (enterprise compliance) - model_name: gpt-4o-azure litellm_params: model: azure/gpt-4o api_base: os.environ/AZURE_API_BASE api_key: os.environ/AZURE_API_KEY api_version: "2024-02-15-preview" # Load balancing strategy router_settings: routing_strategy: least-cost # Options: least-cost, least-latency, simple-shuffle num_retries: 2 timeout: 60 fallbacks: - gpt-4o-mini - claude-3-5-sonnet - gpt-4o-azure # Rate limiting (optional) litellm_settings: max_parallel_requests: 100 # Logging (for debugging) general_settings: master_key: your-secret-key-here # Change this! database_url: postgresql://localhost/litellm # Optional: for request logging
Key Configuration Decisions:
Routing Strategy:
least-cost: Always route to cheapest provider (recommended for batch processing)least-latency: Route to fastest provider (recommended for user-facing apps)simple-shuffle: Distribute load evenly (recommended for rate limit management)
Fallback Chain:
- Primary fails β Try secondary β Try tertiary
- Prevents total outage if one provider is down
Step 4: Set Up Environment Variables
Create .env file:
# OpenAI OPENAI_API_KEY=sk-proj-... # Anthropic ANTHROPIC_API_KEY=sk-ant-... # Azure OpenAI (if using) AZURE_API_KEY=... AZURE_API_BASE=https://your-resource.openai.azure.com AZURE_DEPLOYMENT_NAME=gpt-4o # Proxy master key (for authentication) LITELLM_MASTER_KEY=sk-1234567890abcdef
Security Note: Never commit .env to git. Add to .gitignore.
Step 5: Start the Edge Proxy
Using Docker:
docker run -d \ --name litellm-proxy \ -p 4000:4000 \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ --env-file .env \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml
Using pip:
litellm --config litellm_config.yaml
Verify it's running:
curl http://localhost:4000/health # Should return: {"status": "healthy"}
Step 6: Update Your Application Code
Before (Direct OpenAI calls):
import openai client = openai.OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello!"}] )
After (Via Edge Proxy):
import openai # Point to your proxy instead of OpenAI directly client = openai.OpenAI( api_key="sk-1234567890abcdef", # Your proxy master key base_url="http://localhost:4000" # Your proxy URL ) # Same code as before - no other changes needed! response = client.chat.completions.create( model="gpt-4o-mini", # Proxy will route intelligently messages=[{"role": "user", "content": "Hello!"}] )
That's it! Your app now routes through the proxy with zero code changes.
Step 7: Test the Routing
Test cost-based routing:
# This should route to cheapest provider (gpt-4o-mini) curl -X POST http://localhost:4000/chat/completions \ -H "Authorization: Bearer sk-1234567890abcdef" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}] }'
Test fallback:
# Temporarily disable OpenAI key to test fallback # Request should automatically retry with Anthropic
Check logs:
docker logs litellm-proxy # Look for routing decisions: # [INFO] Routing request to openai/gpt-4o-mini (cost: $0.0001) # [INFO] Request successful (latency: 234ms)
Step 8: Advanced Configuration (Optional)
A) Custom Routing Rules
Route based on input characteristics:
# In litellm_config.yaml router_settings: routing_strategy: custom custom_routing: # Short prompts (<100 tokens) β cheapest - condition: input_tokens < 100 route_to: gpt-4o-mini # Long prompts (>1000 tokens) β Claude (better context) - condition: input_tokens > 1000 route_to: claude-3-5-sonnet # Default β balanced option - default: gpt-4o-mini
B) A/B Testing
Test new models on a percentage of traffic:
router_settings: routing_strategy: weighted model_weights: gpt-4o-mini: 0.8 # 80% of traffic claude-3-5-haiku: 0.2 # 20% of traffic (testing)
C) Regional Routing
Route to closest provider for latency:
regional_routing: us-east: azure/gpt-4o-us-east eu-west: azure/gpt-4o-eu-west asia: anthropic/claude-3-5-sonnet # Anthropic has good APAC latency
D) Rate Limit Management
Distribute load to avoid hitting limits:
rate_limit_settings: openai: max_requests_per_minute: 3500 max_tokens_per_minute: 90000 anthropic: max_requests_per_minute: 4000 max_tokens_per_minute: 400000 # Proxy automatically switches providers when nearing limits
Step 9: Monitor & Optimize
View Real-Time Metrics:
LiteLLM provides a built-in dashboard:
# Access at http://localhost:4000/ui # Shows: # - Requests per provider # - Cost breakdown # - Latency by provider # - Error rates
Key Metrics to Track:
-
Cost Savings:
- Before: All requests to GPT-4o
- After: 70% routed to GPT-4o-mini (10x cheaper)
- Savings: ~$1,500/month on 100K calls
-
Latency:
- Average response time by provider
- P95 latency (95th percentile)
- Timeout rate
-
Reliability:
- Fallback success rate
- Provider uptime
- Error rate by provider
-
Cost per Request:
- Track over time
- Should decrease as routing optimizes
Step 10: Deploy to Production
Option A: Docker Compose (Simple)
# docker-compose.yml version: '3.8' services: litellm-proxy: image: ghcr.io/berriai/litellm:main-latest ports: - "4000:4000" volumes: - ./litellm_config.yaml:/app/config.yaml env_file: - .env restart: unless-stopped command: --config /app/config.yaml
Deploy:
docker-compose up -d
Option B: Kubernetes (Enterprise)
# k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: litellm-proxy spec: replicas: 3 # For high availability selector: matchLabels: app: litellm-proxy template: metadata: labels: app: litellm-proxy spec: containers: - name: litellm image: ghcr.io/berriai/litellm:main-latest ports: - containerPort: 4000 volumeMounts: - name: config mountPath: /app/config.yaml subPath: litellm_config.yaml envFrom: - secretRef: name: litellm-secrets volumes: - name: config configMap: name: litellm-config --- apiVersion: v1 kind: Service metadata: name: litellm-proxy spec: type: LoadBalancer ports: - port: 4000 targetPort: 4000 selector: app: litellm-proxy
Deploy:
kubectl apply -f k8s/
Option C: Managed Service (No Infrastructure)
Use Portkey.ai or LiteLLM Cloud:
- No server management
- Built-in observability
- Enterprise support
- Higher cost but zero ops
Expected Results
Before Edge Proxy:
- 100% of requests to OpenAI GPT-4o
- Cost: $0.005/1K input tokens
- Monthly cost (1M tokens): $5,000
- Single point of failure
After Edge Proxy:
- 70% routed to GPT-4o-mini: $0.00015/1K tokens
- 20% routed to Claude Sonnet: $0.003/1K tokens
- 10% remain on GPT-4o: $0.005/1K tokens
- Monthly cost: ~$1,200 (76% savings!)
- Multi-provider redundancy
Additional Benefits:
- No downtime if OpenAI has issues
- Automatic rate limit management
- Easy A/B testing of new models
- Centralized logging and monitoring
Troubleshooting
Issue: Proxy returns 401 Unauthorized
Cause: Incorrect master key
Fix:
# Check your .env file cat .env | grep LITELLM_MASTER_KEY # Make sure your app uses this key: client = openai.OpenAI(api_key="your-master-key-here")
Issue: Requests still going directly to OpenAI
Cause: base_url not set correctly
Fix:
# Ensure base_url points to proxy client = openai.OpenAI( base_url="http://localhost:4000", # β Must be set api_key="your-master-key" )
Issue: Fallback not working
Cause: Timeout too high or fallback chain not configured
Fix:
# In litellm_config.yaml router_settings: timeout: 30 # Lower timeout triggers fallback faster num_retries: 2 fallbacks: - gpt-4o-mini - claude-3-5-sonnet # β Must list all fallback options
Issue: High latency after adding proxy
Cause: Proxy not co-located with app
Fix:
- Deploy proxy close to your application (same region)
- Use connection pooling
- Enable keep-alive connections
Issue: Cost not decreasing
Cause: Wrong routing strategy
Fix:
# Change to cost-optimized routing router_settings: routing_strategy: least-cost # β Prioritize cost
Testing Checklist
Before deploying to production:
- Proxy responds to health checks
- Can authenticate with master key
- Requests route to primary provider
- Fallback triggers when primary fails
- Latency is acceptable (<500ms added)
- Logging captures request metadata
- Dashboard shows metrics correctly
- Cost tracking is accurate
- All API keys are in environment variables (not hardcoded)
- Proxy restarts automatically if it crashes
Maintenance
Weekly:
- Review cost savings vs last week
- Check error rates by provider
- Verify fallback success rate
Monthly:
- Update LiteLLM to latest version
- Review and optimize routing rules
- Test new models in A/B config
- Analyze latency trends
Quarterly:
- Re-evaluate provider pricing (providers change prices)
- Audit API key rotation
- Review capacity planning
Next Steps
Once your Edge Proxy is running:
-
Add Circuit Breakers (see Circuit Breaker Implementation Guide)
- Prevent cascade failures
- Automatic provider disabling
-
Implement Caching (see Caching Implementation Guide)
- Cache repeated requests
- Save 30-50% on duplicate queries
-
Set Up Monitoring (see Monitoring Setup Guide)
- Grafana dashboards
- Alert on high error rates
Additional Resources
- LiteLLM Documentation: https://docs.litellm.ai/docs/
- Proxy Architecture: https://docs.litellm.ai/docs/proxy/architecture
- Supported Providers: https://docs.litellm.ai/docs/providers
- Cost Calculator: https://litellm.ai/cost-calculator
Support
Questions about this implementation?
- Onaro Support: support@onaro.io
- LiteLLM Community: https://discord.gg/wuPM9dRgDw
- Book implementation call: https://onaro.io/support
Estimated Implementation Time: 2-4 hours
Difficulty: βββββ (3/5)
Impact: πππππ (5/5)
Last Updated: January 26, 2026
Tested with: LiteLLM v1.25.0, OpenAI SDK 1.12.0, Anthropic SDK 0.18.0