Evidence-Bound: Technical Architecture
For Technical Investors | Last Updated: April 2026
System Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ Next.js Web App (Vercel) │ Mobile (Future) │ API Integrations │
└─────────────────────────────────┬───────────────────────────────────────┘
│ HTTPS/REST
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ API GATEWAY LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ JWT Auth │ │ Rate Limit │ │ Tenant │ │ Audit │ │
│ │ Middleware │ │ Middleware │ │ Resolution │ │ Logging │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ FASTAPI APPLICATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ SERVICE LAYER │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Ask Service │ │ Doc Service │ │ Session Svc │ │ │
│ │ │ (Q&A Orch.) │ │ (Upload) │ │ (History) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ CORE PIPELINE COMPONENTS │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ │Retrieval │ │Evidence │ │Policy │ │Verifier │ │ │ │
│ │ │ │(Search) │ │(Citation)│ │(Gates) │ │(LLM QA) │ │ │ │
│ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ |
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬───────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ POSTGRESQL │ │ SEARCH INDEX │ │ LLM PROVIDER │
│ (Data Store) │ │ (Azure/pgvector)│ │ (Multi-provider)│
└─────────────────┘ └─────────────────┘ └─────────────────┘Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Frontend | Next.js 16 + TypeScript | SSR, React ecosystem, Vercel deployment |
| API | FastAPI + Python 3.12 | Async performance, type hints, OpenAPI |
| Database | PostgreSQL 15 | ACID compliance, JSON support, pgvector ready |
| Search | Azure AI Search / pgvector | Hybrid BM25+vector, configurable provider |
| Embeddings | Azure OpenAI / Local | text-embedding-3-large (3072D) or hash-based |
| LLM | Azure OpenAI / Anthropic / Gemini / Ollama | Multi-provider support via config |
| Document Parsing | Marker / LlamaParse / PyPDF | Configurable parser (NFR-036) |
| Observability | OpenTelemetry + Langfuse | Distributed tracing, LLM-specific metrics |
| Auth | JWT + OIDC (Microsoft/Google) | Refresh tokens, SSO, account lockout |
Core Components
1. Retrieval Pipeline (retrieval.py)
Hybrid Search Strategy:
Query → Embed → [BM25 Search] + [Vector Search] → Rerank → Top-K Chunks| Stage | Purpose | Configuration |
|---|---|---|
| BM25 | Keyword matching, exact terms | Weight: 0.3 |
| Vector | Semantic similarity | Weight: 0.7, k=50 |
| Reranker | Cross-encoder reordering | Semantic configuration |
| Top-K | Final chunk selection | k=10 (configurable) |
Tenant Isolation:
# Every search includes mandatory filters
filter_expression = f"tenant_id eq '{tenant_id}' and matter_id eq '{matter_id}'"2. Evidence Validation (evidence.py)
Post-LLM Citation Verification:
- Extract citation spans from LLM response
- Verify each cited chunk exists in retrieval results
- Validate page numbers match document metadata
- Reject response if any citation is invalid
Refusal Conditions:
- No citations in response → Refuse
- Citation references non-existent chunk → Refuse
- Page number doesn’t exist in document → Refuse
3. Policy Engine (policy.py)
Pre-LLM Gates:
- Query length validation
- Tenant/matter authorization
- Rate limiting check
Post-LLM Gates:
- Confidence threshold (< 0.70 → refuse)
- Citation validation (via evidence.py)
- Content policy compliance
4. LLM Verification (verification.py)
Secondary LLM Check:
# Verifies answer relevance to retrieved chunks
relevance_score = verify_relevance(
question=query,
chunk_text=retrieved_context,
answer=llm_response
)Returns confidence score (0.0-1.0) used by policy engine.
Data Model
11 tables, all with tenant_id column (indexed). Every query enforces tenant/matter scope.
Key tables: documents → chunks → index_records (ingestion pipeline), qa_sessions → qa_messages (conversations), users → matter_assignments (RBAC), telemetry (per-request metrics), audit_events (immutable log).
See Architecture Diagrams for the ER diagram and data-model.md for complete SQL schemas.
Provider Abstraction
Status: ✅ Fully Implemented (NFR-032, NFR-034, NFR-035, NFR-036)
The architecture supports pluggable providers for deployment flexibility:
Implemented Abstractions
# config.py - Provider selection (change via env vars only)
LLM_PROVIDER = "azure_openai" # azure_openai | anthropic | gemini | ollama
SEARCH_PROVIDER = "local" # local (pgvector) | azure
EMBEDDINGS_MODE = "remote" # remote (Azure) | local (hash-based)
PARSER_PROVIDER = "marker" # marker | llamaparse | pypdfAll four abstractions use Python Protocol interfaces. See interfaces.md for full definitions, method signatures, and per-provider configuration.
No code changes needed — swap providers via environment variables only.
Deployment Architecture
Cloud Deployment (Current)
┌─────────────────────────────────────────────────────────────────┐
│ AZURE │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Container Apps │ │ Flexible Server │ │ AI Search │ │
│ │ (FastAPI) │ │ (PostgreSQL) │ │ (Hybrid Index) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌─────────────────┐ ┌───────┴───────┐ ┌─────────────────┐ │
│ │ Azure OpenAI │ │ Blob Storage │ │ Key Vault │ │
│ │(GPT-5-mini+Emb) │ │ (Documents) │ │ (Secrets) │ │
│ └─────────────────┘ └───────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ VERCEL │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Next.js Frontend │ │
│ │ (SSR, Static Assets, Edge) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘Deployment Tiers
| Tier | Components | Monthly Cost Estimate |
|---|---|---|
| Development | Container Apps (B1), PostgreSQL (Burstable B1), AI Search (Free) | ~$50 |
| Production | Container Apps (P1v3), PostgreSQL (GP D2s), AI Search (Standard S1) | ~$500-800 |
| Enterprise | Dedicated VNet, Premium PostgreSQL, Reserved capacity | ~$2,000+ |
On-Premises Deployment
Customer Data Center
├── Kubernetes Cluster
│ ├── API Pods (FastAPI)
│ ├── Worker Pods (Document Processing)
│ └── Ingress Controller
├── PostgreSQL (+ pgvector extension)
├── MinIO (S3-compatible storage)
└── Local LLM (Ollama) OR Anthropic/Gemini API via VPNConfiguration:
# On-prem deployment tier
LLM_PROVIDER=ollama # Local Llama 3.2
SEARCH_PROVIDER=local # PostgreSQL + pgvector
EMBEDDINGS_MODE=local # Hash-based (or remote via VPN)
PARSER_PROVIDER=marker # Marker (offline PDF parsing)Security Architecture
Data Protection
| Layer | Mechanism |
|---|---|
| Transport | TLS 1.3 (enforced) |
| Storage | AES-256 encryption at rest |
| Secrets | Azure Key Vault / env injection |
| Logs | PII redaction before write |
| Tokens | SHA256 hashes only (no plaintext) |
Authentication Flow (FR-050, FR-051)
See Architecture Diagrams — Authentication Flow for the full sequence diagram.
Security Features (FR-050):
- Password hashing: Argon2id (OWASP recommended)
- Account lockout: 5 failed attempts → 30 min lock
- Refresh token rotation: New token on each refresh
- Token revocation: All tokens revoked on password change
- PKCE for SSO: Protects against authorization code interception
Tenant Isolation Enforcement
# FastAPI dependency injects tenant context on every request
def get_tenant_context(request: Request) -> TenantContext:
# Extract from JWT claims (AUTH_MODE=jwt) or headers (dev mode)
token = validate_jwt(request)
tenant_id = token["tenant_id"]
user_id = token["sub"]
user_role = Role(token["role"])
return TenantContext(
tenant_id=tenant_id,
user_id=user_id,
user_role=user_role,
)
# Used on every endpoint via Depends()
@router.get("/v1/matters")
async def list_matters(ctx: TenantContext = Depends(get_tenant_context)):
# ctx.tenant_id is guaranteed present — enforced at extraction
matters = list_matters_for_tenant(
tenant_id=ctx.tenant_id,
user_id=ctx.user_id,
user_role=ctx.user_role.value,
)Observability Stack
Metrics & Tracing
Application
│
├── OpenTelemetry SDK
│ ├── Traces → Azure Monitor / Jaeger
│ ├── Metrics → Prometheus / Azure Monitor
│ └── Logs → stdout → Azure Log Analytics
│
└── Langfuse Integration (NFR-045)
├── LLM Call Traces (@observe decorators)
├── Token Usage (gen_ai.usage.* metrics)
├── Latency Distributions (llm.latency_ms)
└── Model Performance (by provider)Key Metrics Tracked
| Metric | Purpose | OTEL Semantic Convention |
|---|---|---|
llm.latency_ms | LLM response time | Custom |
gen_ai.usage.prompt_tokens | Input token count | GenAI |
gen_ai.usage.completion_tokens | Output token count | GenAI |
gen_ai.request.model | Model identifier | GenAI |
gen_ai.system | Provider name | GenAI |
retrieval.latency_ms | Search latency | Custom |
refusal_rate | Percentage of refused queries | Custom |
cache_hit_rate | Embedding cache efficiency | Custom |
Database Telemetry Table:
-- All LLM calls logged to telemetry table (NFR-030)
SELECT tenant_id, model_id,
SUM(tokens_in) as total_prompt_tokens,
SUM(tokens_out) as total_completion_tokens,
SUM(cost_est) as total_cost_usd
FROM telemetry
WHERE timestamp_utc >= NOW() - INTERVAL '30 days'
GROUP BY tenant_id, model_id;Scalability Considerations
Current Capacity
| Resource | Limit | Scaling Path |
|---|---|---|
| Concurrent requests | ~10-15 req/s per instance | Horizontal pod scaling |
| Document processing | ~10 docs/min | Worker queue + async |
| Search index | 1M chunks | Index partitioning |
| Database | 100 GB | Vertical scaling, read replicas |
Scaling Strategy
- Stateless API: Horizontal scaling via container replicas
- Async Processing: Document ingestion queued (Azure Queue / Redis)
- Caching: Embedding cache reduces LLM calls by ~40%
- Index Partitioning: Per-tenant indexes for large deployments
Development Practices
Quality Gates (CI/CD)
# All must pass before merge
ruff check apps/ # Linting
mypy apps/api/app --strict # Type checking (NFR-040)
pytest tests/ -v # Unit + integration
pytest evals/ -v # Golden query evals (>95% pass required)Test Coverage
| Category | Coverage Target | Current |
|---|---|---|
| Unit tests (624 tests) | >80% | ~85% |
| Integration tests | Critical paths | ✅ |
| Golden query evals | >95% pass rate | ✅ |
| LLM behavior tests | Adversarial prompts | ✅ |
Test-Driven Development (TDD)
Enforced via CLAUDE.md:
RED → Write test that fails (proves test works)
GREEN → Write minimum code to pass
REFACTOR → Clean up, maintain passing tests
COMMIT → Only after greenTechnical Risks & Mitigations
| Risk | Mitigation |
|---|---|
| LLM hallucination | Post-LLM citation validation, confidence gating |
| Search relevance drift | Golden query evals in CI, reranker tuning |
| Vendor lock-in | Provider abstraction interfaces (implemented) |
| Cost overrun | Token tracking, caching, query limits, telemetry table |
| Data breach | Tenant isolation at DB layer, encryption, audit log |
| Account compromise | Account lockout, refresh token rotation, MFA (planned) |
Roadmap (Technical)
| Phase | Focus | Status |
|---|---|---|
| Phase 2 | Production hardening | ✅ Complete (SSO, export, observability) |
| Phase 3 | Multi-tenancy | ✅ Complete (RBAC, matter-level permissions) |
| Phase 4 | Deployment flexibility | ✅ Complete (Provider abstraction) |
| Phase 5 | Advanced features | 🚧 In Progress (Fine-tuned models, custom retrievers) |
Implementation Status
| Feature | Status | FRs |
|---|---|---|
| Tenant isolation | ✅ | FR-001 |
| Matter isolation | ✅ | FR-002 |
| RBAC | ✅ | FR-003 |
| Matter-level permissions | ✅ | FR-004 |
| JWT authentication | ✅ | FR-050 |
| OIDC SSO | ✅ | FR-051 |
| Audit logging | ✅ | FR-040 |
| Data retention policies | ✅ | FR-042 |
| Provider abstraction (LLM) | ✅ | NFR-032 |
| Provider abstraction (Search) | ✅ | NFR-034 |
| Provider abstraction (Embedding) | ✅ | NFR-035 |
| Provider abstraction (Parser) | ✅ | NFR-036 |
| Type safety (mypy —strict) | ✅ | NFR-040 |
| LLM telemetry | ✅ | NFR-030, NFR-045 |
For detailed schemas, see data-model.md For provider interfaces, see interfaces.md For feature descriptions, see FEATURES.md