RAG Harness: Extracting a Trusted Foundation from Evidence-Bound
Date: 2026-03-31 Goal: Extract a repeatable, tested, production-grade RAG harness from the Evidence-Bound codebase that can serve as a starting point for any domain.
1. Scorecard: Evidence-Bound vs Enterprise RAG Blueprint
Scoring our current system against each principle from the enterprise RAG blueprint.
| # | Blueprint Principle | Evidence-Bound Today | Score | Gap |
|---|---|---|---|---|
| 1 | Document Quality Router — route by doc quality, not assumption | Parser abstraction (3 providers), OCR detection, min-text-chars validation. But: same pipeline for all quality levels. No routing by scan quality or structure. | 4/10 | No quality classifier. Scanned 1995 invoices and clean digital PDFs hit the same chunking pipeline. |
| 2 | Metadata > Vectors — structured filters beat fancier embeddings | Tenant/matter filtering in Azure Search OData. Doc-level metadata extraction (title, author, pages). But: no metadata used in retrieval scoring. Filters are identity-based, not content-based. | 5/10 | No doc_type, date, or tag filtering in search. Metadata extracted but not indexed as filterable fields. |
| 3 | Tables as Structured Objects — dual embedding for tables | Zero table handling. Chunking treats tables as text. Financial tables get shredded into token soup. | 1/10 | No table detection, no structured extraction, no dual embedding. This is the biggest gap for legal contracts with indemnification schedules. |
| 4 | Hybrid Retrieval — BM25 + Dense + Graph | BM25 + vector with RRF fusion. Azure semantic reranker as cross-encoder. Local reranker fallback. | 7/10 | No GraphRAG for entity relationships or cross-document multi-hop. Legal: “find all clauses referencing Party B across 50 documents” fails. |
| 5 | Hierarchy / TreeRAG — respect document structure | Flat chunking with page offsets and char positions. No document hierarchy (section, subsection, clause). | 2/10 | Legal documents are deeply hierarchical (Article > Section > Clause > Sub-clause). Flat chunking loses this entirely. |
| 6 | Agentic Loop — hypothesize, retrieve, verify, refine | Retrieve + verify (parallel LLM verification). Auto-verify fast path. But: single-shot. No refinement if first retrieval misses. No query rewriting. | 5/10 | No re-retrieval, no query decomposition, no “the first answer wasn’t good enough, let me try differently.” |
| 7 | Retrieval as Security Boundary — chunks are untrusted | Injection gate pre-LLM. Chunk marked <chunk> (untrusted) in verifier prompt. Span blocklist. Homoglyph normalization. | 8/10 | Strongest area. Missing: content hash verification (confirm chunk wasn’t tampered between index and retrieval). |
| 8 | Observability + Citations from Day 1 | Langfuse full pipeline tracing. OTEL custom metrics. Per-request cost. Citation validation with 90% similarity threshold. Negation flip detection. | 9/10 | This is where Evidence-Bound shines. Built in from the start, not retrofitted. |
Overall: 5.1/10 — Strong on observability, citations, and security. Weak on document intelligence (tables, hierarchy, quality routing).
2. What’s Already Reusable (the 70% that’s generic)
Evidence-Bound has five clean abstraction layers that are immediately extractable:
Provider Interfaces (all have ABC base + factory)
| Interface | Implementations | Status |
|---|---|---|
SearchClient | Azure AI Search, Local (BM25+vector) | Production-tested |
LLMClient | Azure OpenAI, Anthropic, Gemini, Ollama | 4 providers shipped |
EmbeddingClient | Azure OpenAI, Local (hash) | Production-tested |
ParserClient | PyPDF, Marker (OCR), LlamaParse (cloud) | 3 providers shipped |
RerankerClient | Local (term+phrase analysis) | Extensible |
Generic Infrastructure
| Component | File | Reusable? |
|---|---|---|
| BM25 scoring engine | retrieval.py | 100% — textbook BM25 with configurable k1, b |
| RRF fusion | retrieval.py | 100% — standard reciprocal rank fusion |
| Embedding cache (LRU) | cache.py | 100% — thread-safe, stats tracking |
| Query result cache (TTL) | cache.py | 100% — tenant-scoped, configurable |
| Injection detection | policy.py | 100% — 22 regex patterns + homoglyph normalization |
| Citation validation | evidence.py | 90% — similarity check + negation detection |
| httpx connection pool | http_client.py | 95% — singleton, HTTP/2, configurable limits |
| Cost tracking | cost.py | 100% — token-based, per-component breakdown |
| OTEL + Langfuse setup | otel.py | 95% — GenAI semantic conventions, PII-safe |
| Parallel verification | ask_service.py | 80% — ThreadPoolExecutor pattern |
| Chunking with offsets | ingestion.py | 70% — page/char offset preservation |
3. The Harness Architecture
Core Idea
The harness is a configured pipeline, not a framework. You compose it from interchangeable providers and plug in domain-specific logic at defined extension points.
┌──────────────────────────────────────────────────────────────────────┐
│ RAG HARNESS CORE │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌────────────────┐ │
│ │ Ingest │──>│ Store │──>│ Retrieve │──>│ Generate │ │
│ │ │ │ │ │ │ │ │ │
│ │ Parse │ │ Chunks │ │ Hybrid │ │ LLM + Verify │ │
│ │ Chunk │ │ Vectors │ │ Search │ │ Grade + Cite │ │
│ │ Embed │ │ Metadata │ │ Rerank │ │ Validate │ │
│ └──────────┘ └──────────┘ └───────────┘ └────────────────┘ │
│ │ │ │ │ │
│ ┌────┴────┐ ┌─────┴────┐ ┌─────┴─────┐ ┌──────┴──────┐ │
│ │ Parser │ │ Search │ │ Embedding │ │ LLM Client │ │
│ │ Client │ │ Client │ │ Client │ │ (4 provdrs) │ │
│ │ (3 imp) │ │ (2 imp) │ │ (2 imp) │ │ + Reranker │ │
│ └─────────┘ └──────────┘ └───────────┘ └─────────────┘ │
│ │
│ CROSS-CUTTING: │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Security │ │ Observ- │ │ Caching │ │ Cost Tracking │ │
│ │ (Injection │ │ ability │ │ (Embed + │ │ (per-component, │ │
│ │ + Blocklst│ │ (OTEL + │ │ Query │ │ per-request) │ │
│ │ + Homglyph│ │ Langfuse) │ │ LRU+TTL)│ │ │ │
│ └────────────┘ └────────────┘ └──────────┘ └───────────────────┘ │
│ │
│ EXTENSION POINTS (domain-specific): │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Data │ │ Evidence │ │ Verifier │ │ Response │ │
│ │ Isolation │ │ Grader │ │ Prompt │ │ Schema │ │
│ │ (tenant/ │ │ (A/B/C or │ │ (legal, │ │ (citations, │ │
│ │ workspace)│ │ custom) │ │ medical)│ │ refusals, etc) │ │
│ └────────────┘ └────────────┘ └──────────┘ └───────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘Extension Points (where domain logic plugs in)
| Extension Point | Legal (Evidence-Bound) | Medical | Financial |
|---|---|---|---|
| Data Isolation | tenant_id + matter_id | patient_id + study_id | client_id + portfolio_id |
| Evidence Grader | A/B/C (verification + reranker + overlap) | Clinical confidence levels | Regulatory confidence |
| Verifier Prompt | ”Does chunk contain exact legal evidence?" | "Does chunk contain clinical finding?" | "Does chunk cite regulatory source?” |
| Response Schema | Citation (doc, page, span) + refusal codes | Finding (study, section, conclusion) | Reference (regulation, clause, date) |
| RBAC Roles | Admin/Attorney/Paralegal/Viewer | Admin/Doctor/Nurse/Patient | Admin/Analyst/Compliance/Auditor |
| Quality Router | Contract vs pleading vs memo | Lab report vs clinical note vs imaging | 10-K vs invoice vs contract |
4. What’s Missing for a World-Class Harness
Gap 1: Document Quality Router (Blueprint Principle 1)
What it does: Classifies incoming documents by type and quality before parsing, then routes to the appropriate pipeline.
Why it matters: A scanned 1995 fax should not go through the same 512-token chunking as a clean digital PDF. The fax needs aggressive OCR, table detection, and smaller chunks. The PDF needs metadata extraction and structural parsing.
Harness design:
class DocumentQualityRouter:
"""Route documents to appropriate processing pipeline."""
def classify(self, file_path: str, metadata: dict) -> DocumentProfile:
"""Return quality profile: digital/scanned/mixed, structure level, table density."""
def route(self, profile: DocumentProfile) -> ProcessingPipeline:
"""Select parser, chunking strategy, and embedding approach based on profile."""What Evidence-Bound has today: Parser selection via PARSER_PROVIDER env var (global, not per-document). OCR fallback in Marker. Min-text-chars detection for OCR warning.
What to build: Per-document classification based on: text extractability (digital vs scanned), structure detection (headers, tables, lists), page count, file size. Route to different chunk_size + parser + OCR settings.
Gap 2: Table-Aware Processing (Blueprint Principle 3)
What it does: Detects tables in documents, extracts them as structured objects (CSV/markdown), and creates dual embeddings: one for the structured data, one for a natural language summary.
Why it matters: Legal contracts have indemnification schedules, fee tables, payment terms. Financial docs have balance sheets, P&L statements. Standard chunking destroys these.
Harness design:
class TableProcessor:
"""Extract and dual-embed tables from documents."""
def detect_tables(self, page_content: str) -> list[TableRegion]:
"""Identify table boundaries in page text."""
def extract_structured(self, table: TableRegion) -> StructuredTable:
"""Convert to CSV/markdown preserving headers and values."""
def dual_embed(self, table: StructuredTable) -> tuple[list[float], list[float]]:
"""Return (structural_embedding, summary_embedding)."""What Evidence-Bound has today: Nothing. Tables are tokenized as flat text.
What to build: Marker already detects tables during OCR. Expose that detection, extract as markdown, generate summary via LLM, embed both representations. Store with chunk_type: "table" metadata for retrieval filtering.
Gap 3: Metadata-Enriched Retrieval (Blueprint Principle 2)
What it does: Uses structured metadata (document type, date, author, section headers) as first-class retrieval filters, not just vector similarity.
Why it matters: “What were the Q4 2024 revenue figures?” should filter by doc_type=10K AND date=2024Q4 before doing vector search, not rely on embeddings to figure out the date.
Harness design:
class MetadataFilter:
"""Build search filters from query analysis and document metadata."""
def extract_query_filters(self, question: str) -> dict[str, Any]:
"""Parse date ranges, doc types, entity names from question."""
def apply_to_search(self, base_query: SearchQuery, filters: dict) -> SearchQuery:
"""Add metadata filters to search request."""What Evidence-Bound has today: Metadata extraction (title, author, page_count) at ingestion. Tenant/matter filtering in Azure Search. But: metadata not indexed as filterable fields, not used in retrieval scoring.
What to build: Index doc_type, doc_date, author, custom tags as filterable + facetable fields in Azure Search. Add query analysis to extract metadata hints. Boost results matching metadata filters.
Gap 4: Hierarchical Document Structure (Blueprint Principle 5)
What it does: Parses document structure (sections, subsections, clauses) and builds a tree. Broad questions retrieve from high-level summaries; precise questions drill down to specific clauses.
Why it matters: “What does Article 7 say about termination?” should go directly to Article 7, not scan all 500 chunks looking for “termination.”
Harness design:
class DocumentTree:
"""Hierarchical document representation."""
def build_tree(self, parsed_doc: ParseResult) -> TreeNode:
"""Build section/subsection tree from parsed document."""
def multi_level_embed(self, tree: TreeNode) -> list[TreeChunk]:
"""Embed at multiple granularity levels: section summary + leaf chunks."""
def route_query(self, question: str, tree: TreeNode) -> list[TreeChunk]:
"""Broad question -> high-level nodes. Precise question -> leaf nodes."""What Evidence-Bound has today: Flat chunks with page numbers. No section detection.
What to build: Section header detection during parsing (regex + LLM). Parent-child chunk relationships. Multi-level embedding (section summary + individual paragraphs). Query routing based on specificity.
Gap 5: Agentic Retrieval Loop (Blueprint Principle 6)
What it does: If first retrieval doesn’t find a confident answer, the system reformulates the query and tries again. Hypothesize -> Retrieve -> Verify -> Refine.
Why it matters: “What is the total exposure across all agreements?” requires: (1) find all agreements, (2) find exposure clauses in each, (3) aggregate. Single-shot retrieval can’t do this.
Harness design:
class AgenticRetriever:
"""Multi-turn retrieval with query refinement."""
def retrieve_with_refinement(
self,
question: str,
max_iterations: int = 3,
) -> AgenticResult:
"""
Loop:
1. Analyze question -> decompose if complex
2. Retrieve candidates
3. Verify relevance
4. If confidence < threshold: reformulate query, try again
5. If max iterations: return best result or refuse
"""What Evidence-Bound has today: Single-shot retrieve + verify. Follow-up question detection for conversational context. Auto-verify fast path. But: no query decomposition, no re-retrieval, no refinement.
What to build: Query complexity classifier (simple vs multi-hop vs aggregation). Query decomposition into sub-queries. Iterative retrieval with semantic caching to avoid redundant searches. Confidence-gated loop exit.
5. Harness Package Structure
rag-harness/
├── pyproject.toml # Package config
├── README.md
│
├── rag_harness/
│ ├── __init__.py
│ ├── pipeline.py # RAGPipeline orchestrator
│ ├── config.py # Pydantic Settings (not flat env vars)
│ │
│ ├── ingest/
│ │ ├── __init__.py
│ │ ├── chunker.py # Chunking with offsets (from ingestion.py)
│ │ ├── quality_router.py # NEW: Document quality classification
│ │ └── table_processor.py # NEW: Table detection + dual embedding
│ │
│ ├── providers/
│ │ ├── __init__.py
│ │ ├── search/
│ │ │ ├── base.py # SearchClient ABC (from search/base.py)
│ │ │ ├── azure.py # Azure AI Search (from search/azure.py)
│ │ │ └── local.py # BM25 + Vector + RRF (from search/local.py)
│ │ ├── llm/
│ │ │ ├── base.py # LLMClient ABC (from llm/base.py)
│ │ │ ├── azure_openai.py
│ │ │ ├── anthropic.py
│ │ │ ├── gemini.py
│ │ │ └── ollama.py
│ │ ├── embedding/
│ │ │ ├── base.py # EmbeddingClient ABC
│ │ │ ├── azure_openai.py
│ │ │ └── local.py
│ │ ├── parser/
│ │ │ ├── base.py # ParserClient ABC (from parsers/base.py)
│ │ │ ├── pypdf.py
│ │ │ ├── marker.py
│ │ │ └── llamaparse.py
│ │ └── reranker/
│ │ ├── base.py # RerankerClient ABC
│ │ └── local.py
│ │
│ ├── retrieval/
│ │ ├── __init__.py
│ │ ├── bm25.py # BM25 engine (from retrieval.py)
│ │ ├── fusion.py # RRF fusion (from retrieval.py)
│ │ ├── hybrid.py # Hybrid search orchestration
│ │ └── agentic.py # NEW: Multi-turn retrieval loop
│ │
│ ├── verification/
│ │ ├── __init__.py
│ │ ├── verifier.py # LLM verification (from verification.py)
│ │ ├── parallel.py # Parallel verification (from ask_service.py)
│ │ └── auto_verify.py # Fast path for high-confidence (from ask_service.py)
│ │
│ ├── citation/
│ │ ├── __init__.py
│ │ ├── validator.py # Citation validation (from evidence.py)
│ │ ├── grader.py # Evidence grading (from evidence.py, configurable)
│ │ └── negation.py # Adversarial negation detection (from evidence.py)
│ │
│ ├── security/
│ │ ├── __init__.py
│ │ ├── injection.py # Prompt injection detection (from policy.py)
│ │ ├── blocklist.py # Span content blocklist (from verification.py)
│ │ └── isolation.py # Data isolation interface (NEW)
│ │
│ ├── observability/
│ │ ├── __init__.py
│ │ ├── otel.py # OpenTelemetry setup (from otel.py)
│ │ ├── langfuse.py # Langfuse integration (from otel.py)
│ │ ├── metrics.py # Custom metrics (from otel.py)
│ │ └── cost.py # Cost tracking (from cost.py)
│ │
│ ├── cache/
│ │ ├── __init__.py
│ │ ├── embedding_cache.py # LRU embedding cache (from cache.py)
│ │ └── query_cache.py # TTL query cache (from cache.py)
│ │
│ └── http/
│ ├── __init__.py
│ └── client.py # httpx pool manager (from http_client.py)
│
├── tests/
│ ├── test_bm25.py
│ ├── test_fusion.py
│ ├── test_injection.py
│ ├── test_citation.py
│ ├── test_verification.py
│ ├── test_cache.py
│ ├── test_cost.py
│ └── test_pipeline.py # End-to-end pipeline test
│
└── examples/
├── legal/ # Evidence-Bound domain config
│ ├── grader.py # Legal evidence grading
│ ├── verifier_prompt.txt # Legal verification prompt
│ ├── roles.py # Attorney/Paralegal/Viewer
│ └── schemas.py # Citation + EvidenceSupport
├── medical/ # Example: clinical RAG
│ ├── grader.py
│ ├── verifier_prompt.txt
│ └── schemas.py
└── quickstart.py # Minimal working example6. Quickstart Example (what using the harness looks like)
from rag_harness import RAGPipeline
from rag_harness.providers.llm import AzureOpenAIClient
from rag_harness.providers.embedding import AzureOpenAIEmbeddingClient
from rag_harness.providers.search import LocalSearchClient
from rag_harness.providers.parser import MarkerParser
from rag_harness.citation import EvidenceGrader
from rag_harness.security import InjectionDetector
# Configure the pipeline
pipeline = RAGPipeline(
llm=AzureOpenAIClient(endpoint="...", api_key="...", model="gpt-5-mini"),
embedding=AzureOpenAIEmbeddingClient(endpoint="...", deployment="text-embedding-3-large"),
search=LocalSearchClient(top_k=5, rrf_k=60),
parser=MarkerParser(force_ocr=False),
# Domain-specific extension points
grader=EvidenceGrader(
thresholds={"A": 2.5, "B": 1.5, "C": 0.3},
require_verification=True,
),
security=InjectionDetector(patterns="default"),
# Infrastructure
cache_embeddings=True,
cache_queries=True,
enable_langfuse=True,
enable_otel=True,
)
# Ingest a document
pipeline.ingest("contract.pdf", metadata={"doc_type": "contract", "date": "2024-01-15"})
# Ask a question
result = pipeline.ask(
question="What is the indemnification cap?",
filters={"tenant_id": "acme-corp", "workspace_id": "case-42"},
)
# Result includes: answer, citations, evidence grade, cost, latency breakdown
print(result.answer) # "According to Section 8.2 (page 14)..."
print(result.citations) # [Citation(doc="contract.pdf", page=14, span="...")]
print(result.evidence.grade) # "A"
print(result.cost.total_usd) # 0.0034
print(result.latency.total_ms)# 21007. What Makes This Harness Different
Most RAG frameworks give you retrieval. This gives you trust.
| Feature | LangChain / LlamaIndex | This Harness |
|---|---|---|
| Citation validation | None (LLM generates citations) | 90% similarity check + negation detection |
| Evidence grading | None | Configurable A/B/C with multiple signals |
| Refusal policy | None (always answers) | Configurable confidence gate — refuse when unsure |
| Injection detection | Basic | 22 patterns + homoglyph normalization + span blocklist |
| Chunk security | None (chunks are trusted) | Chunks marked untrusted in LLM context |
| Cost tracking | None or basic | Per-component, per-request, with breakdown |
| Observability | Optional add-on | Built-in: OTEL spans + Langfuse traces + PII-safe |
| Multi-tenant | Not supported | First-class data isolation at every layer |
| LLM verification | None | Parallel verification with auto-verify fast path |
| Provider abstraction | Framework-locked | 4 LLM, 3 parser, 2 search, 2 embedding providers |
The pitch: “Every answer is either cited and graded, or the system refuses. You can’t get a hallucinated citation. You can’t get a confident-sounding wrong answer. You get evidence or you get nothing.”
8. Extraction Roadmap
Phase 1: Core Package (2 weeks)
Extract the generic components into rag-harness/:
- Provider interfaces (SearchClient, LLMClient, EmbeddingClient, ParserClient, RerankerClient)
- BM25 engine + RRF fusion
- Citation validator + negation detector
- Injection detector + blocklist
- Embedding cache + query cache
- httpx client pool
- Cost tracker
- OTEL + Langfuse setup
- Pydantic Settings config (replace flat env vars)
- Tests for all extracted components
Phase 2: Pipeline Orchestrator (1 week)
Build RAGPipeline that composes the providers:
pipeline.ingest()— parse + chunk + embed + indexpipeline.ask()— search + verify + grade + cite- Extension points for domain-specific grading, verification prompts, response schemas
- End-to-end test with local providers (no Azure dependency)
Phase 3: Missing Capabilities (3-4 weeks)
Build the gaps identified in the enterprise blueprint:
- Document quality router
- Table-aware processing (detect + extract + dual embed)
- Metadata-enriched retrieval
- Agentic retrieval loop (multi-turn with refinement)
- Hierarchical document structure (TreeRAG-lite)
Phase 4: Domain Templates (1 week per domain)
Create example configurations:
examples/legal/— Evidence-Bound config (already exists)examples/medical/— Clinical RAG configexamples/financial/— Regulatory compliance configexamples/quickstart.py— Minimal working example
9. Answer: Do We Have a Repeatable Process?
Yes, about 70% of one.
Evidence-Bound accidentally built a good RAG harness while building a legal product. The provider abstractions are excellent. The security, observability, and citation validation are ahead of most open-source RAG frameworks. The caching, cost tracking, and parallel verification are production-grade.
What’s missing is the document intelligence layer (quality routing, table processing, hierarchical parsing) and the agentic loop (multi-turn refinement). These are the pieces that separate a demo from a system that works on real enterprise data.
The extraction is worth doing. The codebase is clean enough to refactor without rewriting. And the resulting harness would be genuinely differentiated: not another “wrapper around LangChain,” but a trust-first RAG foundation where every answer is either evidence-backed or refused.