Architecture Review: Is This Solving the Right Problem?
Date: 2026-03-31 Question: Does this architecture actually solve the problem of retrieving documents for lawyers by matter?
Status note: Several foundation items identified here shipped later on
2026-03-31: Azure Search timeout, DB QueuePool, safer auth defaults, matter
creation/access hardening, per-user/per-matter session isolation, zero-doc
matter visibility, the matter dashboard follow-up fixes, and the httpx
migration for ask-path Azure calls. The remaining sections below should be
read as the open roadmap after those changes.
The Short Answer
The retrieval works. The architecture around it doesn’t match how lawyers actually work.
The system correctly answers “given a question and a matter, find the relevant chunk and cite it.” That’s the core RAG loop, and it’s implemented solidly: hybrid search, verification, citation validation, evidence grading. The technical pipeline is sound.
What’s missing is everything around that loop — the parts that make it a product lawyers would actually use instead of a technical demo.
What the Architecture Gets Right
1. Evidence-First Design
The fundamental design decision — every answer requires a retrievable, verifiable citation or the system refuses — is correct for legal use. Lawyers cannot use a system that hallucinates. The confidence gating (policy.py), evidence grading (evidence.py), and citation validation (evidence.py:139-188) create a trust foundation that most legal AI products lack.
2. Matter-Level Isolation
Legal work is organized by matter (case). The decision to make matter_id a first-class filter on every query, every table, and every search request is architecturally correct. This maps to how law firms organize: documents belong to cases, not to users.
3. Tenant Isolation
The tenant_id on every row prevents cross-firm data leakage. This is non-negotiable for legal SaaS and it’s implemented at the database level, not just the API level. Azure Search filters include tenant_id in every query.
4. Audit Trail
Immutable audit events with user_id, tenant_id, matter_id, and timestamp satisfy basic legal compliance requirements. The retention policy system (configurable per tenant, per resource type) is a thoughtful addition.
What the Architecture Gets Wrong
1. The Verification Step Is Architecturally Redundant
The pipeline does three forms of relevance assessment:
Step 1: Azure Semantic Reranker (cross-encoder)
"Given this query and this document chunk, how relevant is the chunk?"
Output: Score 0-4
Step 2: LLM Verification (GPT-5-mini)
"Given this question and this chunk, does the chunk contain the answer?"
Output: YES/NO with span extraction
Step 3: Evidence Grading (rule-based)
"Given reranker score, verification status, overlap, and RRF score, grade A/B/C"
Output: GradeSteps 1 and 2 are asking the same question in different ways. The Azure semantic reranker IS a cross-encoder transformer model trained on relevance. The LLM verification is a general-purpose language model re-evaluating relevance. You’re paying 2-6 seconds of latency and $0.001-0.003 per query for a second opinion that agrees with the first 90%+ of the time.
Recommendation: Use the LLM for what it’s uniquely good at — span extraction and answer synthesis — not binary relevance classification. The reranker already handles relevance. Restructure:
Azure Search (BM25 + vector + reranker) --> relevance ranking
--> Confidence filter (reranker score threshold)
--> LLM Answer Synthesis (one call, not verification)
"Given this question and these top 3 chunks, write an answer with citations"
--> Citation validation (verify spans exist in source)This eliminates the verification loop entirely, replaces it with a single LLM call that produces the final answer, and uses citation validation as the safety net (which already exists in evidence.py).
2. No Conversational Memory Within a Matter
The system answers one question at a time. Each /v1/ask request starts fresh. But legal document review is inherently conversational:
Attorney: "What is the indemnification cap in the merger agreement?"
System: "Section 8.2 states the cap is $15M [1]"
Attorney: "Does that include environmental liabilities?"
System: ??? (no context from previous question)The qa_sessions and qa_messages tables exist for storage, but the retrieval pipeline doesn’t use conversation history. The LLM prompt doesn’t include prior Q&A. This means every question must be fully self-contained, which is not how lawyers think.
What’s needed:
- Pass last 3-5 QA messages as context to the LLM
- Use conversation history to expand ambiguous queries (“it” -> “the indemnification cap”)
- This is a prompt engineering change, not an architecture change — the data model already supports it
3. No Document-Level Understanding
The system chunks documents and retrieves chunks. It never builds a holistic understanding of a document. This matters because lawyers ask questions like:
- “Summarize the key terms of this agreement” (requires full document, not a chunk)
- “Compare the indemnification provisions across these three contracts” (requires multiple documents)
- “What’s unusual about this lease compared to standard commercial leases?” (requires domain knowledge + full document)
These are the high-value questions. The current chunk-level retrieval can’t answer them.
What’s needed:
- Document-level embeddings in addition to chunk-level
- A “summarize” endpoint that operates on full documents, not just chunks
- Cross-document comparison capabilities
- This is a significant feature addition, not a fix
4. No Privileged Document Handling
Legal documents have privilege designations (attorney-client privilege, work product doctrine). The data model has metadata_json on documents but no first-class privilege flag. A paralegal searching across a matter might surface privileged documents that shouldn’t appear in certain exports or client-facing reports.
The metadata_json field could contain privilege flags, but there’s no enforcement in the retrieval pipeline. retrieval.py doesn’t filter by privilege status. evidence.py doesn’t flag privileged content.
What’s needed:
privilege_statuscolumn ondocuments(or in metadata with index)- Retrieval filter: exclude privileged documents from non-privileged searches
- Export filter: strip privileged content from client-facing exports
- Audit: log when privileged documents are accessed
5. No Document Versioning
Legal documents go through drafts. A merger agreement might have v1 (initial draft), v2 (redline), v3 (execution copy). The current system deduplicates by SHA256 (doc_sha256), which means each version is a separate document. But there’s no way to:
- Track that v1, v2, v3 are versions of the same document
- Ask “what changed between v2 and v3?”
- Pin a question to a specific version vs. “latest”
The docs_snapshot_id concept is close but serves a different purpose (deduplication, not versioning).
6. The “Which Documents Can I See” Problem
You identified this as confusing. Here’s why:
The current access model is:
User signs in --> JWT has tenant_id + user_id + role
User picks a matter --> X-Matter-Id header
System checks --> Does user have MatterAssignment for this matter?
If yes --> User sees ALL documents in that matter
If admin --> User sees ALL documents in ALL mattersThe confusion is that there’s no middle ground:
- You either see everything in a matter or nothing
- You can’t share a specific document with someone without giving them the entire matter
- There’s no “view-only on this document but full access on that one”
For small firms with 2-3 attorneys per matter, this is fine. For large firms with 50+ attorneys and complex case teams, this is too coarse.
What’s missing:
- Document-level permissions (optional, for firms that need it)
- Sharing a document or set of documents without sharing the entire matter
- “Review sets” — curated collections of documents within a matter for specific reviewers
This is a Phase 3+ feature, not a bug. But it’s worth knowing the access model is coarse-grained before selling to large firms.
Architecture Comparison: What Best-in-Class Legal AI Does (2026)
| Capability | Evidence-Bound | Best-in-Class (Relativity, Casetext, Harvey) |
|---|---|---|
| Chunk-level retrieval | Yes | Yes |
| Document-level understanding | No | Yes (full-doc embeddings, summaries) |
| Cross-document analysis | No | Yes (comparative analysis) |
| Conversation memory | Storage only | Active context (multi-turn) |
| Privilege handling | No | Yes (privilege log, auto-detection) |
| Document versioning | No | Yes (version chains, redline comparison) |
| Batch review | No | Yes (review sets, bulk Q&A) |
| Citation verification | Yes (strong) | Varies (some trust LLM output) |
| Multi-tenant isolation | Yes | Yes |
| On-prem deployment | Yes (Ollama tier) | Rare (Harvey does, most don’t) |
Evidence-Bound’s competitive advantages:
- On-prem option (Ollama + pgvector) — most competitors are cloud-only
- Evidence-first refusal policy — stronger citation validation than most
- Provider abstraction — swap LLM/search/parser without code changes
- Open-source friendly stack — no vendor lock-in to proprietary vector DBs
Evidence-Bound’s gaps:
- No document-level intelligence (the “summarize this contract” use case)
- No conversational context (multi-turn Q&A)
- No privilege handling
- Single-question, single-answer UX (not batch review)
Architectural Debt Inventory
| Debt Item | Location | Impact | Effort to Fix |
|---|---|---|---|
| Sync pipeline | ask_service.py | Blocks scaling | Large (async migration) |
| 700-line execute_ask() | ask_service.py | Hard to modify/test | Medium (decompose) |
| In-memory caches | retrieval.py, cache.py | Don’t scale horizontally | Medium (Redis) |
| No request deadline | ask_service.py | No latency guarantee | Small (add timeout) |
| Global mutable state | _BM25_CACHE, _PROMPT_TEXT | Thread safety risk | Small (use lru_cache) |
| config.py flat list | config.py | Error-prone, no validation | Medium (pydantic Settings) |
| Unbounded BM25 cache | retrieval.py:509-515 | Memory leak | Small (add TTL/max size) |
| Azure Search API 2023-11-01 | config.py:93 | Missing vector compression | Small (bump version) |
Recommended Architecture Evolution
Phase 8a: Remaining Foundation Work (post 03-31, 1-2 weeks)
- Finish the remaining latency fix from
LATENCY_FIXES.md(Redis follow-up) - Add request deadline/timeout to
execute_ask() - Add load tests for NFR-011 and NFR-012
Phase 8b: Restructure the Pipeline (2-3 weeks)
- Decompose
execute_ask()into:retrieve(),verify(),synthesize(),cite() - Replace verification loop with single LLM answer synthesis call
- Add conversation context to LLM prompt (pass last 3 messages)
Phase 9: Document Intelligence (4-6 weeks)
- Document-level embeddings and summaries
- “Summarize this document” endpoint
- Cross-document comparison endpoint
- Privilege flag on documents with retrieval filtering
Phase 10: Production Hardening (2-4 weeks)
- Async pipeline migration
- Redis cache layer
- Circuit breakers on external services
- Structured logging
- Horizontal scaling validation
The Fundamental Question
Is this actually solving the problem of retrieving documents for lawyers by matter?
Yes, for the narrow use case of “ask a question about documents in this case.” The retrieval pipeline works. The citation validation is strong. The matter isolation is correct.
No, for the broader use case of “help lawyers understand their cases.” Lawyers don’t just ask questions — they review, compare, summarize, and analyze. The system does one of those things well. The other three are missing entirely.
The architecture is a solid foundation for a question-answering product. Making it a document intelligence product requires the additions outlined in Phase 9 above. The priority question is: do you need all of that for a beta pilot, or can you sell the Q&A capability on its own?
For a beta pilot with 5-10 attorneys: Fix the latency, add conversation context, and ship Q&A only. That’s a viable product for firms who already know which questions they want to ask.
For a paid product competing with Harvey/Casetext: You need document-level intelligence, batch review, and privilege handling. That’s 2-3 more phases of work.