Edge Proxy Implementation Guide

Implement Request Routing & Load Balancing for AI APIs

Difficulty: Intermediate
Time Required: 2-4 hours
Potential Savings: $500-2,000/month (depending on volume)
Best For: Organizations with >100K API calls/month

What is an Edge Proxy?

An Edge Proxy sits between your application and AI provider APIs, intelligently routing requests to optimize for:

Cost (cheapest provider for the task)
Latency (fastest provider for your region)
Availability (failover if primary provider is down)
Rate limits (distribute load across multiple providers)

How It Works:

Your App → Edge Proxy → [Route Decision] → Best AI Provider
                              ↓
                    ┌─────────┼─────────┐
                    ↓         ↓         ↓
                 OpenAI   Anthropic   Azure

The proxy analyzes each request and routes to the optimal provider based on your rules.

Prerequisites

Before starting, ensure you have:

Multiple AI provider API keys (at least 2 providers)
Docker installed (or a server to run the proxy)
Basic knowledge of HTTP APIs and environment variables
Access to your application's AI API call code

Recommended Providers to Set Up:

OpenAI (primary)
Anthropic Claude (fallback/cost-effective)
Azure OpenAI (enterprise/compliance)

Implementation Steps

Step 1: Choose Your Edge Proxy Solution

Option A: LiteLLM Proxy (Recommended for Most)

Best for: Easy setup, supports 100+ providers, built-in load balancing

Option B: Custom Proxy (Advanced)

Best for: Full control, custom routing logic, specific requirements

Option C: Portkey.ai (Managed Service)

Best for: No infrastructure management, enterprise support

We'll use LiteLLM for this guide (most popular open-source option)

Step 2: Install LiteLLM Proxy

Using Docker (Recommended):

# Pull the latest LiteLLM image
docker pull ghcr.io/berriai/litellm:main-latest

# Create configuration directory
mkdir -p ~/litellm-config
cd ~/litellm-config

# Create config file (we'll populate this next)
touch litellm_config.yaml

Alternative: Install via pip:

pip install 'litellm[proxy]' --break-system-packages

# Verify installation
litellm --version

Step 3: Configure Provider Routing

Create litellm_config.yaml:

model_list:
  # Fast & Cheap: GPT-4o-mini for simple tasks
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      cost_per_1k_input_tokens: 0.00015
      cost_per_1k_output_tokens: 0.0006
  
  # Balanced: Claude 3.5 Sonnet for complex tasks
  - model_name: claude-3-5-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY
    model_info:
      cost_per_1k_input_tokens: 0.003
      cost_per_1k_output_tokens: 0.015
  
  # Fallback: Azure OpenAI (enterprise compliance)
  - model_name: gpt-4o-azure
    litellm_params:
      model: azure/gpt-4o
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-02-15-preview"

# Load balancing strategy
router_settings:
  routing_strategy: least-cost # Options: least-cost, least-latency, simple-shuffle
  num_retries: 2
  timeout: 60
  fallbacks:
    - gpt-4o-mini
    - claude-3-5-sonnet
    - gpt-4o-azure
  
# Rate limiting (optional)
litellm_settings:
  max_parallel_requests: 100
  
# Logging (for debugging)
general_settings:
  master_key: your-secret-key-here # Change this!
  database_url: postgresql://localhost/litellm # Optional: for request logging

Key Configuration Decisions:

Routing Strategy:

least-cost: Always route to cheapest provider (recommended for batch processing)
least-latency: Route to fastest provider (recommended for user-facing apps)
simple-shuffle: Distribute load evenly (recommended for rate limit management)

Fallback Chain:

Primary fails → Try secondary → Try tertiary
Prevents total outage if one provider is down

Step 4: Set Up Environment Variables

Create .env file:

# OpenAI
OPENAI_API_KEY=sk-proj-...

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# Azure OpenAI (if using)
AZURE_API_KEY=...
AZURE_API_BASE=https://your-resource.openai.azure.com
AZURE_DEPLOYMENT_NAME=gpt-4o

# Proxy master key (for authentication)
LITELLM_MASTER_KEY=sk-1234567890abcdef

Security Note: Never commit .env to git. Add to .gitignore.

Step 5: Start the Edge Proxy

Using Docker:

docker run -d \
  --name litellm-proxy \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  --env-file .env \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Using pip:

litellm --config litellm_config.yaml

Verify it's running:

curl http://localhost:4000/health
# Should return: {"status": "healthy"}

Step 6: Update Your Application Code

Before (Direct OpenAI calls):

import openai

client = openai.OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

After (Via Edge Proxy):

import openai

# Point to your proxy instead of OpenAI directly
client = openai.OpenAI(
    api_key="sk-1234567890abcdef",  # Your proxy master key
    base_url="http://localhost:4000"  # Your proxy URL
)

# Same code as before - no other changes needed!
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Proxy will route intelligently
    messages=[{"role": "user", "content": "Hello!"}]
)

That's it! Your app now routes through the proxy with zero code changes.

Step 7: Test the Routing

Test cost-based routing:

# This should route to cheapest provider (gpt-4o-mini)
curl -X POST http://localhost:4000/chat/completions \
  -H "Authorization: Bearer sk-1234567890abcdef" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Test fallback:

# Temporarily disable OpenAI key to test fallback
# Request should automatically retry with Anthropic

Check logs:

docker logs litellm-proxy

# Look for routing decisions:
# [INFO] Routing request to openai/gpt-4o-mini (cost: $0.0001)
# [INFO] Request successful (latency: 234ms)

Step 8: Advanced Configuration (Optional)

A) Custom Routing Rules

Route based on input characteristics:

# In litellm_config.yaml

router_settings:
  routing_strategy: custom
  
custom_routing:
  # Short prompts (<100 tokens) → cheapest
  - condition: input_tokens < 100
    route_to: gpt-4o-mini
  
  # Long prompts (>1000 tokens) → Claude (better context)
  - condition: input_tokens > 1000
    route_to: claude-3-5-sonnet
  
  # Default → balanced option
  - default: gpt-4o-mini

B) A/B Testing

Test new models on a percentage of traffic:

router_settings:
  routing_strategy: weighted
  
model_weights:
  gpt-4o-mini: 0.8  # 80% of traffic
  claude-3-5-haiku: 0.2  # 20% of traffic (testing)

C) Regional Routing

Route to closest provider for latency:

regional_routing:
  us-east: azure/gpt-4o-us-east
  eu-west: azure/gpt-4o-eu-west
  asia: anthropic/claude-3-5-sonnet  # Anthropic has good APAC latency

D) Rate Limit Management

Distribute load to avoid hitting limits:

rate_limit_settings:
  openai:
    max_requests_per_minute: 3500
    max_tokens_per_minute: 90000
  anthropic:
    max_requests_per_minute: 4000
    max_tokens_per_minute: 400000

# Proxy automatically switches providers when nearing limits

Step 9: Monitor & Optimize

View Real-Time Metrics:

LiteLLM provides a built-in dashboard:

# Access at http://localhost:4000/ui
# Shows:
# - Requests per provider
# - Cost breakdown
# - Latency by provider
# - Error rates

Key Metrics to Track:

Cost Savings:
- Before: All requests to GPT-4o
- After: 70% routed to GPT-4o-mini (10x cheaper)
- Savings: ~$1,500/month on 100K calls
Latency:
- Average response time by provider
- P95 latency (95th percentile)
- Timeout rate
Reliability:
- Fallback success rate
- Provider uptime
- Error rate by provider
Cost per Request:
- Track over time
- Should decrease as routing optimizes

Step 10: Deploy to Production

Option A: Docker Compose (Simple)

# docker-compose.yml

version: '3.8'
services:
  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    env_file:
      - .env
    restart: unless-stopped
    command: --config /app/config.yaml

Deploy:

docker-compose up -d

Option B: Kubernetes (Enterprise)

# k8s/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
spec:
  replicas: 3  # For high availability
  selector:
    matchLabels:
      app: litellm-proxy
  template:
    metadata:
      labels:
        app: litellm-proxy
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-latest
        ports:
        - containerPort: 4000
        volumeMounts:
        - name: config
          mountPath: /app/config.yaml
          subPath: litellm_config.yaml
        envFrom:
        - secretRef:
            name: litellm-secrets
      volumes:
      - name: config
        configMap:
          name: litellm-config
---
apiVersion: v1
kind: Service
metadata:
  name: litellm-proxy
spec:
  type: LoadBalancer
  ports:
  - port: 4000
    targetPort: 4000
  selector:
    app: litellm-proxy

Deploy:

kubectl apply -f k8s/

Option C: Managed Service (No Infrastructure)

Use Portkey.ai or LiteLLM Cloud:

No server management
Built-in observability
Enterprise support
Higher cost but zero ops

Expected Results

Before Edge Proxy:

100% of requests to OpenAI GPT-4o
Cost: $0.005/1K input tokens
Monthly cost (1M tokens): $5,000
Single point of failure

After Edge Proxy:

70% routed to GPT-4o-mini: $0.00015/1K tokens
20% routed to Claude Sonnet: $0.003/1K tokens
10% remain on GPT-4o: $0.005/1K tokens
Monthly cost: ~$1,200 (76% savings!)
Multi-provider redundancy

Additional Benefits:

No downtime if OpenAI has issues
Automatic rate limit management
Easy A/B testing of new models
Centralized logging and monitoring

Troubleshooting

Issue: Proxy returns 401 Unauthorized

Cause: Incorrect master key

Fix:

# Check your .env file
cat .env | grep LITELLM_MASTER_KEY

# Make sure your app uses this key:
client = openai.OpenAI(api_key="your-master-key-here")

Issue: Requests still going directly to OpenAI

Cause: base_url not set correctly

Fix:

# Ensure base_url points to proxy
client = openai.OpenAI(
    base_url="http://localhost:4000",  # ← Must be set
    api_key="your-master-key"
)

Issue: Fallback not working

Cause: Timeout too high or fallback chain not configured

Fix:

# In litellm_config.yaml
router_settings:
  timeout: 30  # Lower timeout triggers fallback faster
  num_retries: 2
  fallbacks:
    - gpt-4o-mini
    - claude-3-5-sonnet  # ← Must list all fallback options

Issue: High latency after adding proxy

Cause: Proxy not co-located with app

Fix:

Deploy proxy close to your application (same region)
Use connection pooling
Enable keep-alive connections

Issue: Cost not decreasing

Cause: Wrong routing strategy

Fix:

# Change to cost-optimized routing
router_settings:
  routing_strategy: least-cost  # ← Prioritize cost

Testing Checklist

Before deploying to production:

Maintenance

Weekly:

Review cost savings vs last week
Check error rates by provider
Verify fallback success rate

Monthly:

Update LiteLLM to latest version
Review and optimize routing rules
Test new models in A/B config
Analyze latency trends

Quarterly:

Re-evaluate provider pricing (providers change prices)
Audit API key rotation
Review capacity planning

Next Steps

Once your Edge Proxy is running:

Add Circuit Breakers (see Circuit Breaker Implementation Guide)
- Prevent cascade failures
- Automatic provider disabling
Implement Caching (see Caching Implementation Guide)
- Cache repeated requests
- Save 30-50% on duplicate queries
Set Up Monitoring (see Monitoring Setup Guide)
- Grafana dashboards
- Alert on high error rates

Additional Resources

Support

Questions about this implementation?

Estimated Implementation Time: 2-4 hours
Difficulty: ⭐⭐⭐☆☆ (3/5)
Impact: 🚀🚀🚀🚀🚀 (5/5)

Last Updated: January 26, 2026
Tested with: LiteLLM v1.25.0, OpenAI SDK 1.12.0, Anthropic SDK 0.18.0