Skip to Content
Architecture Overview

Evidence-Bound: Technical Architecture

For Technical Investors | Last Updated: April 2026


System Overview

┌─────────────────────────────────────────────────────────────────────────┐ │ CLIENTS │ │ Next.js Web App (Vercel) │ Mobile (Future) │ API Integrations │ └─────────────────────────────────┬───────────────────────────────────────┘ │ HTTPS/REST ┌─────────────────────────────────────────────────────────────────────────┐ │ API GATEWAY LAYER │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ JWT Auth │ │ Rate Limit │ │ Tenant │ │ Audit │ │ │ │ Middleware │ │ Middleware │ │ Resolution │ │ Logging │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────┬───────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────────┐ │ FASTAPI APPLICATION │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ SERVICE LAYER │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Ask Service │ │ Doc Service │ │ Session Svc │ │ │ │ │ │ (Q&A Orch.) │ │ (Upload) │ │ (History) │ │ │ │ │ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ │ │ CORE PIPELINE COMPONENTS │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ │ │Retrieval │ │Evidence │ │Policy │ │Verifier │ │ │ │ │ │ │ │(Search) │ │(Citation)│ │(Gates) │ │(LLM QA) │ │ │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ | │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────┬───────────────────────────────────────┘ ┌───────────────────────┼───────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ POSTGRESQL │ │ SEARCH INDEX │ │ LLM PROVIDER │ │ (Data Store) │ │ (Azure/pgvector)│ │ (Multi-provider)│ └─────────────────┘ └─────────────────┘ └─────────────────┘

Technology Stack

LayerTechnologyRationale
FrontendNext.js 16 + TypeScriptSSR, React ecosystem, Vercel deployment
APIFastAPI + Python 3.12Async performance, type hints, OpenAPI
DatabasePostgreSQL 15ACID compliance, JSON support, pgvector ready
SearchAzure AI Search / pgvectorHybrid BM25+vector, configurable provider
EmbeddingsAzure OpenAI / Localtext-embedding-3-large (3072D) or hash-based
LLMAzure OpenAI / Anthropic / Gemini / OllamaMulti-provider support via config
Document ParsingMarker / LlamaParse / PyPDFConfigurable parser (NFR-036)
ObservabilityOpenTelemetry + LangfuseDistributed tracing, LLM-specific metrics
AuthJWT + OIDC (Microsoft/Google)Refresh tokens, SSO, account lockout

Core Components

1. Retrieval Pipeline (retrieval.py)

Hybrid Search Strategy:

Query → Embed → [BM25 Search] + [Vector Search] → Rerank → Top-K Chunks
StagePurposeConfiguration
BM25Keyword matching, exact termsWeight: 0.3
VectorSemantic similarityWeight: 0.7, k=50
RerankerCross-encoder reorderingSemantic configuration
Top-KFinal chunk selectionk=10 (configurable)

Tenant Isolation:

# Every search includes mandatory filters filter_expression = f"tenant_id eq '{tenant_id}' and matter_id eq '{matter_id}'"

2. Evidence Validation (evidence.py)

Post-LLM Citation Verification:

  1. Extract citation spans from LLM response
  2. Verify each cited chunk exists in retrieval results
  3. Validate page numbers match document metadata
  4. Reject response if any citation is invalid

Refusal Conditions:

  • No citations in response → Refuse
  • Citation references non-existent chunk → Refuse
  • Page number doesn’t exist in document → Refuse

3. Policy Engine (policy.py)

Pre-LLM Gates:

  • Query length validation
  • Tenant/matter authorization
  • Rate limiting check

Post-LLM Gates:

  • Confidence threshold (< 0.70 → refuse)
  • Citation validation (via evidence.py)
  • Content policy compliance

4. LLM Verification (verification.py)

Secondary LLM Check:

# Verifies answer relevance to retrieved chunks relevance_score = verify_relevance( question=query, chunk_text=retrieved_context, answer=llm_response )

Returns confidence score (0.0-1.0) used by policy engine.


Data Model

11 tables, all with tenant_id column (indexed). Every query enforces tenant/matter scope.

Key tables: documentschunksindex_records (ingestion pipeline), qa_sessionsqa_messages (conversations), usersmatter_assignments (RBAC), telemetry (per-request metrics), audit_events (immutable log).

See Architecture Diagrams for the ER diagram and data-model.md for complete SQL schemas.


Provider Abstraction

Status: ✅ Fully Implemented (NFR-032, NFR-034, NFR-035, NFR-036)

The architecture supports pluggable providers for deployment flexibility:

Implemented Abstractions

# config.py - Provider selection (change via env vars only) LLM_PROVIDER = "azure_openai" # azure_openai | anthropic | gemini | ollama SEARCH_PROVIDER = "local" # local (pgvector) | azure EMBEDDINGS_MODE = "remote" # remote (Azure) | local (hash-based) PARSER_PROVIDER = "marker" # marker | llamaparse | pypdf

All four abstractions use Python Protocol interfaces. See interfaces.md for full definitions, method signatures, and per-provider configuration.

No code changes needed — swap providers via environment variables only.


Deployment Architecture

Cloud Deployment (Current)

┌─────────────────────────────────────────────────────────────────┐ │ AZURE │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Container Apps │ │ Flexible Server │ │ AI Search │ │ │ │ (FastAPI) │ │ (PostgreSQL) │ │ (Hybrid Index) │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ └────────────────────┼────────────────────┘ │ │ │ │ │ ┌─────────────────┐ ┌───────┴───────┐ ┌─────────────────┐ │ │ │ Azure OpenAI │ │ Blob Storage │ │ Key Vault │ │ │ │(GPT-5-mini+Emb) │ │ (Documents) │ │ (Secrets) │ │ │ └─────────────────┘ └───────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ VERCEL │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Next.js Frontend │ │ │ │ (SSR, Static Assets, Edge) │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘

Deployment Tiers

TierComponentsMonthly Cost Estimate
DevelopmentContainer Apps (B1), PostgreSQL (Burstable B1), AI Search (Free)~$50
ProductionContainer Apps (P1v3), PostgreSQL (GP D2s), AI Search (Standard S1)~$500-800
EnterpriseDedicated VNet, Premium PostgreSQL, Reserved capacity~$2,000+

On-Premises Deployment

Customer Data Center ├── Kubernetes Cluster │ ├── API Pods (FastAPI) │ ├── Worker Pods (Document Processing) │ └── Ingress Controller ├── PostgreSQL (+ pgvector extension) ├── MinIO (S3-compatible storage) └── Local LLM (Ollama) OR Anthropic/Gemini API via VPN

Configuration:

# On-prem deployment tier LLM_PROVIDER=ollama # Local Llama 3.2 SEARCH_PROVIDER=local # PostgreSQL + pgvector EMBEDDINGS_MODE=local # Hash-based (or remote via VPN) PARSER_PROVIDER=marker # Marker (offline PDF parsing)

Security Architecture

Data Protection

LayerMechanism
TransportTLS 1.3 (enforced)
StorageAES-256 encryption at rest
SecretsAzure Key Vault / env injection
LogsPII redaction before write
TokensSHA256 hashes only (no plaintext)

Authentication Flow (FR-050, FR-051)

See Architecture Diagrams — Authentication Flow for the full sequence diagram.

Security Features (FR-050):

  • Password hashing: Argon2id (OWASP recommended)
  • Account lockout: 5 failed attempts → 30 min lock
  • Refresh token rotation: New token on each refresh
  • Token revocation: All tokens revoked on password change
  • PKCE for SSO: Protects against authorization code interception

Tenant Isolation Enforcement

# FastAPI dependency injects tenant context on every request def get_tenant_context(request: Request) -> TenantContext: # Extract from JWT claims (AUTH_MODE=jwt) or headers (dev mode) token = validate_jwt(request) tenant_id = token["tenant_id"] user_id = token["sub"] user_role = Role(token["role"]) return TenantContext( tenant_id=tenant_id, user_id=user_id, user_role=user_role, ) # Used on every endpoint via Depends() @router.get("/v1/matters") async def list_matters(ctx: TenantContext = Depends(get_tenant_context)): # ctx.tenant_id is guaranteed present — enforced at extraction matters = list_matters_for_tenant( tenant_id=ctx.tenant_id, user_id=ctx.user_id, user_role=ctx.user_role.value, )

Observability Stack

Metrics & Tracing

Application ├── OpenTelemetry SDK │ ├── Traces → Azure Monitor / Jaeger │ ├── Metrics → Prometheus / Azure Monitor │ └── Logs → stdout → Azure Log Analytics └── Langfuse Integration (NFR-045) ├── LLM Call Traces (@observe decorators) ├── Token Usage (gen_ai.usage.* metrics) ├── Latency Distributions (llm.latency_ms) └── Model Performance (by provider)

Key Metrics Tracked

MetricPurposeOTEL Semantic Convention
llm.latency_msLLM response timeCustom
gen_ai.usage.prompt_tokensInput token countGenAI
gen_ai.usage.completion_tokensOutput token countGenAI
gen_ai.request.modelModel identifierGenAI
gen_ai.systemProvider nameGenAI
retrieval.latency_msSearch latencyCustom
refusal_ratePercentage of refused queriesCustom
cache_hit_rateEmbedding cache efficiencyCustom

Database Telemetry Table:

-- All LLM calls logged to telemetry table (NFR-030) SELECT tenant_id, model_id, SUM(tokens_in) as total_prompt_tokens, SUM(tokens_out) as total_completion_tokens, SUM(cost_est) as total_cost_usd FROM telemetry WHERE timestamp_utc >= NOW() - INTERVAL '30 days' GROUP BY tenant_id, model_id;

Scalability Considerations

Current Capacity

ResourceLimitScaling Path
Concurrent requests~10-15 req/s per instanceHorizontal pod scaling
Document processing~10 docs/minWorker queue + async
Search index1M chunksIndex partitioning
Database100 GBVertical scaling, read replicas

Scaling Strategy

  1. Stateless API: Horizontal scaling via container replicas
  2. Async Processing: Document ingestion queued (Azure Queue / Redis)
  3. Caching: Embedding cache reduces LLM calls by ~40%
  4. Index Partitioning: Per-tenant indexes for large deployments

Development Practices

Quality Gates (CI/CD)

# All must pass before merge ruff check apps/ # Linting mypy apps/api/app --strict # Type checking (NFR-040) pytest tests/ -v # Unit + integration pytest evals/ -v # Golden query evals (>95% pass required)

Test Coverage

CategoryCoverage TargetCurrent
Unit tests (624 tests)>80%~85%
Integration testsCritical paths
Golden query evals>95% pass rate
LLM behavior testsAdversarial prompts

Test-Driven Development (TDD)

Enforced via CLAUDE.md:

RED → Write test that fails (proves test works) GREEN → Write minimum code to pass REFACTOR → Clean up, maintain passing tests COMMIT → Only after green

Technical Risks & Mitigations

RiskMitigation
LLM hallucinationPost-LLM citation validation, confidence gating
Search relevance driftGolden query evals in CI, reranker tuning
Vendor lock-inProvider abstraction interfaces (implemented)
Cost overrunToken tracking, caching, query limits, telemetry table
Data breachTenant isolation at DB layer, encryption, audit log
Account compromiseAccount lockout, refresh token rotation, MFA (planned)

Roadmap (Technical)

PhaseFocusStatus
Phase 2Production hardening✅ Complete (SSO, export, observability)
Phase 3Multi-tenancy✅ Complete (RBAC, matter-level permissions)
Phase 4Deployment flexibility✅ Complete (Provider abstraction)
Phase 5Advanced features🚧 In Progress (Fine-tuned models, custom retrievers)

Implementation Status

FeatureStatusFRs
Tenant isolationFR-001
Matter isolationFR-002
RBACFR-003
Matter-level permissionsFR-004
JWT authenticationFR-050
OIDC SSOFR-051
Audit loggingFR-040
Data retention policiesFR-042
Provider abstraction (LLM)NFR-032
Provider abstraction (Search)NFR-034
Provider abstraction (Embedding)NFR-035
Provider abstraction (Parser)NFR-036
Type safety (mypy —strict)NFR-040
LLM telemetryNFR-030, NFR-045

For detailed schemas, see data-model.md For provider interfaces, see interfaces.md For feature descriptions, see FEATURES.md