LLM Provider Setup Guide
This guide covers how to configure and use the different LLM providers supported by Evidence-Bound.
Quick Start
Set the LLM_PROVIDER environment variable to choose your provider:
# Options: azure_openai (default), anthropic, gemini, ollama
LLM_PROVIDER=azure_openaiProvider Comparison
| Provider | Latency | Cost | Quality | Air-Gap | Best For |
|---|---|---|---|---|---|
| Azure OpenAI | Low | $$$ | Excellent | No | Enterprise with existing Azure |
| Anthropic Claude | Low | $$$ | Excellent | No | Best reasoning, legal analysis |
| Google Gemini | Very Low | $$ | Very Good | No | Cost-effective, high volume |
| Ollama (local) | Medium | Free | Good | Yes | On-prem, data sovereignty |
1. Azure OpenAI (Default)
Best for: Enterprise deployments with Azure infrastructure.
Setup
- Create an Azure OpenAI resource in the Azure Portal
- Deploy a model (e.g.,
gpt-5-mini) - Get your endpoint and API key
Configuration
LLM_PROVIDER=azure_openai
# Required
AZURE_OPENAI_CHAT_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_CHAT_API_KEY=your-api-key
MODEL_ID=gpt-5-mini # Your deployment name
# Optional
AZURE_OPENAI_CHAT_API_VERSION=2024-02-15-previewNotes
MODEL_IDis your deployment name, not the model name- Supports GPT-5-mini, GPT-4 Turbo, GPT-3.5 Turbo
- Enterprise SLA and compliance certifications available
2. Anthropic Claude
Best for: Complex legal reasoning, nuanced analysis.
Setup
- Create an account at console.anthropic.com
- Generate an API key
- Add billing information
Configuration
LLM_PROVIDER=anthropic
# Required
ANTHROPIC_API_KEY=sk-ant-api03-xxxxx
# Optional (defaults shown)
ANTHROPIC_MODEL=claude-sonnet-4-20250514Available Models
| Model | Speed | Quality | Cost | Notes |
|---|---|---|---|---|
claude-sonnet-4-20250514 | Fast | Excellent | $$ | Recommended - best balance |
claude-opus-4-20250514 | Slower | Best | $$$ | Highest capability |
claude-3-5-sonnet-20241022 | Fast | Excellent | $$ | Previous generation |
claude-3-5-haiku-20241022 | Very Fast | Good | $ | Cost-effective |
Notes
- Excellent at following complex instructions
- Strong performance on legal document analysis
- 200K context window on most models
3. Google Gemini
Best for: Cost-effective deployments, fast response times.
Setup
- Go to Google AI Studio
- Create an API key
- Enable the Generative AI API
Configuration
LLM_PROVIDER=gemini
# Required
GEMINI_API_KEY=your-api-key
# Optional (defaults shown)
GEMINI_MODEL=gemini-2.0-flashAvailable Models
| Model | Speed | Quality | Cost | Notes |
|---|---|---|---|---|
gemini-2.0-flash | Very Fast | Very Good | $ | Recommended - best value |
gemini-1.5-pro | Fast | Excellent | $$ | Longer context (1M tokens) |
gemini-1.5-flash | Very Fast | Good | $ | Balance of speed/quality |
Notes
- Very competitive pricing
- Fast response times
- Good for high-volume workloads
4. Ollama (Local / On-Prem)
Best for: Air-gapped environments, data sovereignty, development.
Setup
-
Install Ollama
# macOS / Linux curl -fsSL https://ollama.ai/install.sh | sh # Windows # Download from https://ollama.ai/download -
Start the Ollama server
ollama serve # Server runs on http://localhost:11434 -
Pull a model
# Recommended for most cases (16GB RAM) ollama pull llama3.2:8b # Alternative models ollama pull mistral:7b # Fast, good reasoning ollama pull qwen2.5:7b # Good for structured tasks ollama pull llama3.3:70b # Best quality (needs 40GB+ VRAM)
Configuration
LLM_PROVIDER=ollama
# Optional (defaults shown)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:8bRecommended Models for Legal/RAG
| Model | RAM Required | Quality | Speed | Use Case |
|---|---|---|---|---|
llama3.2:8b | 16GB | Good | Fast | General use - best balance |
llama3.3:70b | 40GB+ VRAM | Excellent | Slow | Complex legal reasoning |
mistral:7b | 16GB | Good | Very Fast | Quick queries |
qwen2.5:7b | 16GB | Good | Fast | Structured extraction |
Remote Ollama Server
To use Ollama on a different machine:
# On the Ollama server, allow external connections
OLLAMA_HOST=0.0.0.0 ollama serve
# In your .env
OLLAMA_BASE_URL=http://ollama-server.internal:11434GPU Acceleration
Ollama automatically uses GPU if available:
- NVIDIA: Install CUDA drivers
- Apple Silicon: Metal acceleration automatic
- AMD: ROCm support (Linux only)
Check GPU usage:
ollama ps # Shows running models and memory usageNotes
- No API costs - runs entirely locally
- Data never leaves your network
- Longer response times than cloud providers
- Quality varies by model size
- First request may be slow (model loading)
Switching Providers
Switching providers requires only configuration changes:
# Development: Use Ollama (free, local)
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.2:8b
# Production: Use Azure OpenAI (enterprise SLA)
LLM_PROVIDER=azure_openai
AZURE_OPENAI_CHAT_ENDPOINT=https://...
AZURE_OPENAI_CHAT_API_KEY=...
MODEL_ID=gpt-5-miniNo code changes required. The get_llm_client() factory function returns the appropriate client based on configuration.
Testing Your Configuration
Verify Provider Works
# Run the LLM provider tests
pytest tests/test_llm_providers.py -v
# Test a specific provider (requires credentials)
python -c "
from app.llm import get_llm_client
client = get_llm_client()
print(f'Provider: {client.provider}')
print(f'Model: {client.model}')
"Test with a Simple Query
# Start the API server
cd apps/api && uvicorn app.main:app --reload
# Make a test request
curl -X POST http://localhost:8000/v1/ask \
-H "Content-Type: application/json" \
-H "X-Tenant-Id: test-tenant" \
-H "X-Matter-Id: test-matter" \
-d '{"question": "What is 2+2?"}'Troubleshooting
Azure OpenAI
| Error | Cause | Fix |
|---|---|---|
401 Unauthorized | Invalid API key | Check AZURE_OPENAI_CHAT_API_KEY |
404 Not Found | Wrong endpoint or deployment | Verify AZURE_OPENAI_CHAT_ENDPOINT and MODEL_ID |
429 Too Many Requests | Rate limit exceeded | Implement backoff or upgrade quota |
Anthropic
| Error | Cause | Fix |
|---|---|---|
401 Invalid API key | Bad or expired key | Regenerate key at console.anthropic.com |
429 Rate limit | Too many requests | Add delay between requests |
400 Bad request | Invalid parameters | Check model name and parameters |
Gemini
| Error | Cause | Fix |
|---|---|---|
403 API key invalid | Bad key or API not enabled | Enable Generative AI API in Google Cloud |
429 Rate limit | Quota exceeded | Check quota at Google Cloud Console |
400 Bad request | Invalid model or parameters | Verify model name |
Ollama
| Error | Cause | Fix |
|---|---|---|
Connection refused | Ollama not running | Run ollama serve |
Model not found | Model not pulled | Run ollama pull <model> |
Timeout | Model too large / slow hardware | Use smaller model or increase timeout |
Out of memory | Insufficient RAM/VRAM | Use smaller model or add memory |
Security Considerations
API Key Management
- Never commit API keys to version control
- Use environment variables or secrets managers
- Rotate keys periodically
- Use separate keys for dev/staging/production
Ollama Security
- Ollama has no built-in authentication
- Don’t expose Ollama to public internet
- Use firewall rules to restrict access
- Consider VPN for remote access
Error Message Sanitization
All providers sanitize error messages to prevent API key leakage:
# Gemini: API keys in error responses are redacted
"Gemini HTTP 400: [REDACTED]..."
# Ollama: Internal URLs are not exposed
"Cannot connect to Ollama server. Is Ollama running?"Environment Variable Reference
# === Provider Selection ===
LLM_PROVIDER=azure_openai # azure_openai | anthropic | gemini | ollama
# === Azure OpenAI ===
AZURE_OPENAI_CHAT_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_CHAT_API_KEY=your-key
AZURE_OPENAI_CHAT_API_VERSION=2024-02-15-preview
MODEL_ID=gpt-5-mini
# === Anthropic ===
ANTHROPIC_API_KEY=sk-ant-xxx
ANTHROPIC_MODEL=claude-sonnet-4-20250514
# === Gemini ===
GEMINI_API_KEY=your-key
GEMINI_MODEL=gemini-2.0-flash
# === Ollama ===
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:8b