Skip to Content
RAG Harness Spec

RAG Harness: Extracting a Trusted Foundation from Evidence-Bound

Date: 2026-03-31 Goal: Extract a repeatable, tested, production-grade RAG harness from the Evidence-Bound codebase that can serve as a starting point for any domain.


1. Scorecard: Evidence-Bound vs Enterprise RAG Blueprint

Scoring our current system against each principle from the enterprise RAG blueprint.

#Blueprint PrincipleEvidence-Bound TodayScoreGap
1Document Quality Router — route by doc quality, not assumptionParser abstraction (3 providers), OCR detection, min-text-chars validation. But: same pipeline for all quality levels. No routing by scan quality or structure.4/10No quality classifier. Scanned 1995 invoices and clean digital PDFs hit the same chunking pipeline.
2Metadata > Vectors — structured filters beat fancier embeddingsTenant/matter filtering in Azure Search OData. Doc-level metadata extraction (title, author, pages). But: no metadata used in retrieval scoring. Filters are identity-based, not content-based.5/10No doc_type, date, or tag filtering in search. Metadata extracted but not indexed as filterable fields.
3Tables as Structured Objects — dual embedding for tablesZero table handling. Chunking treats tables as text. Financial tables get shredded into token soup.1/10No table detection, no structured extraction, no dual embedding. This is the biggest gap for legal contracts with indemnification schedules.
4Hybrid Retrieval — BM25 + Dense + GraphBM25 + vector with RRF fusion. Azure semantic reranker as cross-encoder. Local reranker fallback.7/10No GraphRAG for entity relationships or cross-document multi-hop. Legal: “find all clauses referencing Party B across 50 documents” fails.
5Hierarchy / TreeRAG — respect document structureFlat chunking with page offsets and char positions. No document hierarchy (section, subsection, clause).2/10Legal documents are deeply hierarchical (Article > Section > Clause > Sub-clause). Flat chunking loses this entirely.
6Agentic Loop — hypothesize, retrieve, verify, refineRetrieve + verify (parallel LLM verification). Auto-verify fast path. But: single-shot. No refinement if first retrieval misses. No query rewriting.5/10No re-retrieval, no query decomposition, no “the first answer wasn’t good enough, let me try differently.”
7Retrieval as Security Boundary — chunks are untrustedInjection gate pre-LLM. Chunk marked <chunk> (untrusted) in verifier prompt. Span blocklist. Homoglyph normalization.8/10Strongest area. Missing: content hash verification (confirm chunk wasn’t tampered between index and retrieval).
8Observability + Citations from Day 1Langfuse full pipeline tracing. OTEL custom metrics. Per-request cost. Citation validation with 90% similarity threshold. Negation flip detection.9/10This is where Evidence-Bound shines. Built in from the start, not retrofitted.

Overall: 5.1/10 — Strong on observability, citations, and security. Weak on document intelligence (tables, hierarchy, quality routing).


2. What’s Already Reusable (the 70% that’s generic)

Evidence-Bound has five clean abstraction layers that are immediately extractable:

Provider Interfaces (all have ABC base + factory)

InterfaceImplementationsStatus
SearchClientAzure AI Search, Local (BM25+vector)Production-tested
LLMClientAzure OpenAI, Anthropic, Gemini, Ollama4 providers shipped
EmbeddingClientAzure OpenAI, Local (hash)Production-tested
ParserClientPyPDF, Marker (OCR), LlamaParse (cloud)3 providers shipped
RerankerClientLocal (term+phrase analysis)Extensible

Generic Infrastructure

ComponentFileReusable?
BM25 scoring engineretrieval.py100% — textbook BM25 with configurable k1, b
RRF fusionretrieval.py100% — standard reciprocal rank fusion
Embedding cache (LRU)cache.py100% — thread-safe, stats tracking
Query result cache (TTL)cache.py100% — tenant-scoped, configurable
Injection detectionpolicy.py100% — 22 regex patterns + homoglyph normalization
Citation validationevidence.py90% — similarity check + negation detection
httpx connection poolhttp_client.py95% — singleton, HTTP/2, configurable limits
Cost trackingcost.py100% — token-based, per-component breakdown
OTEL + Langfuse setupotel.py95% — GenAI semantic conventions, PII-safe
Parallel verificationask_service.py80% — ThreadPoolExecutor pattern
Chunking with offsetsingestion.py70% — page/char offset preservation

3. The Harness Architecture

Core Idea

The harness is a configured pipeline, not a framework. You compose it from interchangeable providers and plug in domain-specific logic at defined extension points.

┌──────────────────────────────────────────────────────────────────────┐ │ RAG HARNESS CORE │ │ │ │ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌────────────────┐ │ │ │ Ingest │──>│ Store │──>│ Retrieve │──>│ Generate │ │ │ │ │ │ │ │ │ │ │ │ │ │ Parse │ │ Chunks │ │ Hybrid │ │ LLM + Verify │ │ │ │ Chunk │ │ Vectors │ │ Search │ │ Grade + Cite │ │ │ │ Embed │ │ Metadata │ │ Rerank │ │ Validate │ │ │ └──────────┘ └──────────┘ └───────────┘ └────────────────┘ │ │ │ │ │ │ │ │ ┌────┴────┐ ┌─────┴────┐ ┌─────┴─────┐ ┌──────┴──────┐ │ │ │ Parser │ │ Search │ │ Embedding │ │ LLM Client │ │ │ │ Client │ │ Client │ │ Client │ │ (4 provdrs) │ │ │ │ (3 imp) │ │ (2 imp) │ │ (2 imp) │ │ + Reranker │ │ │ └─────────┘ └──────────┘ └───────────┘ └─────────────┘ │ │ │ │ CROSS-CUTTING: │ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │ │ │ Security │ │ Observ- │ │ Caching │ │ Cost Tracking │ │ │ │ (Injection │ │ ability │ │ (Embed + │ │ (per-component, │ │ │ │ + Blocklst│ │ (OTEL + │ │ Query │ │ per-request) │ │ │ │ + Homglyph│ │ Langfuse) │ │ LRU+TTL)│ │ │ │ │ └────────────┘ └────────────┘ └──────────┘ └───────────────────┘ │ │ │ │ EXTENSION POINTS (domain-specific): │ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────────────┐ │ │ │ Data │ │ Evidence │ │ Verifier │ │ Response │ │ │ │ Isolation │ │ Grader │ │ Prompt │ │ Schema │ │ │ │ (tenant/ │ │ (A/B/C or │ │ (legal, │ │ (citations, │ │ │ │ workspace)│ │ custom) │ │ medical)│ │ refusals, etc) │ │ │ └────────────┘ └────────────┘ └──────────┘ └───────────────────┘ │ └──────────────────────────────────────────────────────────────────────┘

Extension Points (where domain logic plugs in)

Extension PointLegal (Evidence-Bound)MedicalFinancial
Data Isolationtenant_id + matter_idpatient_id + study_idclient_id + portfolio_id
Evidence GraderA/B/C (verification + reranker + overlap)Clinical confidence levelsRegulatory confidence
Verifier Prompt”Does chunk contain exact legal evidence?""Does chunk contain clinical finding?""Does chunk cite regulatory source?”
Response SchemaCitation (doc, page, span) + refusal codesFinding (study, section, conclusion)Reference (regulation, clause, date)
RBAC RolesAdmin/Attorney/Paralegal/ViewerAdmin/Doctor/Nurse/PatientAdmin/Analyst/Compliance/Auditor
Quality RouterContract vs pleading vs memoLab report vs clinical note vs imaging10-K vs invoice vs contract

4. What’s Missing for a World-Class Harness

Gap 1: Document Quality Router (Blueprint Principle 1)

What it does: Classifies incoming documents by type and quality before parsing, then routes to the appropriate pipeline.

Why it matters: A scanned 1995 fax should not go through the same 512-token chunking as a clean digital PDF. The fax needs aggressive OCR, table detection, and smaller chunks. The PDF needs metadata extraction and structural parsing.

Harness design:

class DocumentQualityRouter: """Route documents to appropriate processing pipeline.""" def classify(self, file_path: str, metadata: dict) -> DocumentProfile: """Return quality profile: digital/scanned/mixed, structure level, table density.""" def route(self, profile: DocumentProfile) -> ProcessingPipeline: """Select parser, chunking strategy, and embedding approach based on profile."""

What Evidence-Bound has today: Parser selection via PARSER_PROVIDER env var (global, not per-document). OCR fallback in Marker. Min-text-chars detection for OCR warning.

What to build: Per-document classification based on: text extractability (digital vs scanned), structure detection (headers, tables, lists), page count, file size. Route to different chunk_size + parser + OCR settings.


Gap 2: Table-Aware Processing (Blueprint Principle 3)

What it does: Detects tables in documents, extracts them as structured objects (CSV/markdown), and creates dual embeddings: one for the structured data, one for a natural language summary.

Why it matters: Legal contracts have indemnification schedules, fee tables, payment terms. Financial docs have balance sheets, P&L statements. Standard chunking destroys these.

Harness design:

class TableProcessor: """Extract and dual-embed tables from documents.""" def detect_tables(self, page_content: str) -> list[TableRegion]: """Identify table boundaries in page text.""" def extract_structured(self, table: TableRegion) -> StructuredTable: """Convert to CSV/markdown preserving headers and values.""" def dual_embed(self, table: StructuredTable) -> tuple[list[float], list[float]]: """Return (structural_embedding, summary_embedding)."""

What Evidence-Bound has today: Nothing. Tables are tokenized as flat text.

What to build: Marker already detects tables during OCR. Expose that detection, extract as markdown, generate summary via LLM, embed both representations. Store with chunk_type: "table" metadata for retrieval filtering.


Gap 3: Metadata-Enriched Retrieval (Blueprint Principle 2)

What it does: Uses structured metadata (document type, date, author, section headers) as first-class retrieval filters, not just vector similarity.

Why it matters: “What were the Q4 2024 revenue figures?” should filter by doc_type=10K AND date=2024Q4 before doing vector search, not rely on embeddings to figure out the date.

Harness design:

class MetadataFilter: """Build search filters from query analysis and document metadata.""" def extract_query_filters(self, question: str) -> dict[str, Any]: """Parse date ranges, doc types, entity names from question.""" def apply_to_search(self, base_query: SearchQuery, filters: dict) -> SearchQuery: """Add metadata filters to search request."""

What Evidence-Bound has today: Metadata extraction (title, author, page_count) at ingestion. Tenant/matter filtering in Azure Search. But: metadata not indexed as filterable fields, not used in retrieval scoring.

What to build: Index doc_type, doc_date, author, custom tags as filterable + facetable fields in Azure Search. Add query analysis to extract metadata hints. Boost results matching metadata filters.


Gap 4: Hierarchical Document Structure (Blueprint Principle 5)

What it does: Parses document structure (sections, subsections, clauses) and builds a tree. Broad questions retrieve from high-level summaries; precise questions drill down to specific clauses.

Why it matters: “What does Article 7 say about termination?” should go directly to Article 7, not scan all 500 chunks looking for “termination.”

Harness design:

class DocumentTree: """Hierarchical document representation.""" def build_tree(self, parsed_doc: ParseResult) -> TreeNode: """Build section/subsection tree from parsed document.""" def multi_level_embed(self, tree: TreeNode) -> list[TreeChunk]: """Embed at multiple granularity levels: section summary + leaf chunks.""" def route_query(self, question: str, tree: TreeNode) -> list[TreeChunk]: """Broad question -> high-level nodes. Precise question -> leaf nodes."""

What Evidence-Bound has today: Flat chunks with page numbers. No section detection.

What to build: Section header detection during parsing (regex + LLM). Parent-child chunk relationships. Multi-level embedding (section summary + individual paragraphs). Query routing based on specificity.


Gap 5: Agentic Retrieval Loop (Blueprint Principle 6)

What it does: If first retrieval doesn’t find a confident answer, the system reformulates the query and tries again. Hypothesize -> Retrieve -> Verify -> Refine.

Why it matters: “What is the total exposure across all agreements?” requires: (1) find all agreements, (2) find exposure clauses in each, (3) aggregate. Single-shot retrieval can’t do this.

Harness design:

class AgenticRetriever: """Multi-turn retrieval with query refinement.""" def retrieve_with_refinement( self, question: str, max_iterations: int = 3, ) -> AgenticResult: """ Loop: 1. Analyze question -> decompose if complex 2. Retrieve candidates 3. Verify relevance 4. If confidence < threshold: reformulate query, try again 5. If max iterations: return best result or refuse """

What Evidence-Bound has today: Single-shot retrieve + verify. Follow-up question detection for conversational context. Auto-verify fast path. But: no query decomposition, no re-retrieval, no refinement.

What to build: Query complexity classifier (simple vs multi-hop vs aggregation). Query decomposition into sub-queries. Iterative retrieval with semantic caching to avoid redundant searches. Confidence-gated loop exit.


5. Harness Package Structure

rag-harness/ ├── pyproject.toml # Package config ├── README.md ├── rag_harness/ │ ├── __init__.py │ ├── pipeline.py # RAGPipeline orchestrator │ ├── config.py # Pydantic Settings (not flat env vars) │ │ │ ├── ingest/ │ │ ├── __init__.py │ │ ├── chunker.py # Chunking with offsets (from ingestion.py) │ │ ├── quality_router.py # NEW: Document quality classification │ │ └── table_processor.py # NEW: Table detection + dual embedding │ │ │ ├── providers/ │ │ ├── __init__.py │ │ ├── search/ │ │ │ ├── base.py # SearchClient ABC (from search/base.py) │ │ │ ├── azure.py # Azure AI Search (from search/azure.py) │ │ │ └── local.py # BM25 + Vector + RRF (from search/local.py) │ │ ├── llm/ │ │ │ ├── base.py # LLMClient ABC (from llm/base.py) │ │ │ ├── azure_openai.py │ │ │ ├── anthropic.py │ │ │ ├── gemini.py │ │ │ └── ollama.py │ │ ├── embedding/ │ │ │ ├── base.py # EmbeddingClient ABC │ │ │ ├── azure_openai.py │ │ │ └── local.py │ │ ├── parser/ │ │ │ ├── base.py # ParserClient ABC (from parsers/base.py) │ │ │ ├── pypdf.py │ │ │ ├── marker.py │ │ │ └── llamaparse.py │ │ └── reranker/ │ │ ├── base.py # RerankerClient ABC │ │ └── local.py │ │ │ ├── retrieval/ │ │ ├── __init__.py │ │ ├── bm25.py # BM25 engine (from retrieval.py) │ │ ├── fusion.py # RRF fusion (from retrieval.py) │ │ ├── hybrid.py # Hybrid search orchestration │ │ └── agentic.py # NEW: Multi-turn retrieval loop │ │ │ ├── verification/ │ │ ├── __init__.py │ │ ├── verifier.py # LLM verification (from verification.py) │ │ ├── parallel.py # Parallel verification (from ask_service.py) │ │ └── auto_verify.py # Fast path for high-confidence (from ask_service.py) │ │ │ ├── citation/ │ │ ├── __init__.py │ │ ├── validator.py # Citation validation (from evidence.py) │ │ ├── grader.py # Evidence grading (from evidence.py, configurable) │ │ └── negation.py # Adversarial negation detection (from evidence.py) │ │ │ ├── security/ │ │ ├── __init__.py │ │ ├── injection.py # Prompt injection detection (from policy.py) │ │ ├── blocklist.py # Span content blocklist (from verification.py) │ │ └── isolation.py # Data isolation interface (NEW) │ │ │ ├── observability/ │ │ ├── __init__.py │ │ ├── otel.py # OpenTelemetry setup (from otel.py) │ │ ├── langfuse.py # Langfuse integration (from otel.py) │ │ ├── metrics.py # Custom metrics (from otel.py) │ │ └── cost.py # Cost tracking (from cost.py) │ │ │ ├── cache/ │ │ ├── __init__.py │ │ ├── embedding_cache.py # LRU embedding cache (from cache.py) │ │ └── query_cache.py # TTL query cache (from cache.py) │ │ │ └── http/ │ ├── __init__.py │ └── client.py # httpx pool manager (from http_client.py) ├── tests/ │ ├── test_bm25.py │ ├── test_fusion.py │ ├── test_injection.py │ ├── test_citation.py │ ├── test_verification.py │ ├── test_cache.py │ ├── test_cost.py │ └── test_pipeline.py # End-to-end pipeline test └── examples/ ├── legal/ # Evidence-Bound domain config │ ├── grader.py # Legal evidence grading │ ├── verifier_prompt.txt # Legal verification prompt │ ├── roles.py # Attorney/Paralegal/Viewer │ └── schemas.py # Citation + EvidenceSupport ├── medical/ # Example: clinical RAG │ ├── grader.py │ ├── verifier_prompt.txt │ └── schemas.py └── quickstart.py # Minimal working example

6. Quickstart Example (what using the harness looks like)

from rag_harness import RAGPipeline from rag_harness.providers.llm import AzureOpenAIClient from rag_harness.providers.embedding import AzureOpenAIEmbeddingClient from rag_harness.providers.search import LocalSearchClient from rag_harness.providers.parser import MarkerParser from rag_harness.citation import EvidenceGrader from rag_harness.security import InjectionDetector # Configure the pipeline pipeline = RAGPipeline( llm=AzureOpenAIClient(endpoint="...", api_key="...", model="gpt-5-mini"), embedding=AzureOpenAIEmbeddingClient(endpoint="...", deployment="text-embedding-3-large"), search=LocalSearchClient(top_k=5, rrf_k=60), parser=MarkerParser(force_ocr=False), # Domain-specific extension points grader=EvidenceGrader( thresholds={"A": 2.5, "B": 1.5, "C": 0.3}, require_verification=True, ), security=InjectionDetector(patterns="default"), # Infrastructure cache_embeddings=True, cache_queries=True, enable_langfuse=True, enable_otel=True, ) # Ingest a document pipeline.ingest("contract.pdf", metadata={"doc_type": "contract", "date": "2024-01-15"}) # Ask a question result = pipeline.ask( question="What is the indemnification cap?", filters={"tenant_id": "acme-corp", "workspace_id": "case-42"}, ) # Result includes: answer, citations, evidence grade, cost, latency breakdown print(result.answer) # "According to Section 8.2 (page 14)..." print(result.citations) # [Citation(doc="contract.pdf", page=14, span="...")] print(result.evidence.grade) # "A" print(result.cost.total_usd) # 0.0034 print(result.latency.total_ms)# 2100

7. What Makes This Harness Different

Most RAG frameworks give you retrieval. This gives you trust.

FeatureLangChain / LlamaIndexThis Harness
Citation validationNone (LLM generates citations)90% similarity check + negation detection
Evidence gradingNoneConfigurable A/B/C with multiple signals
Refusal policyNone (always answers)Configurable confidence gate — refuse when unsure
Injection detectionBasic22 patterns + homoglyph normalization + span blocklist
Chunk securityNone (chunks are trusted)Chunks marked untrusted in LLM context
Cost trackingNone or basicPer-component, per-request, with breakdown
ObservabilityOptional add-onBuilt-in: OTEL spans + Langfuse traces + PII-safe
Multi-tenantNot supportedFirst-class data isolation at every layer
LLM verificationNoneParallel verification with auto-verify fast path
Provider abstractionFramework-locked4 LLM, 3 parser, 2 search, 2 embedding providers

The pitch: “Every answer is either cited and graded, or the system refuses. You can’t get a hallucinated citation. You can’t get a confident-sounding wrong answer. You get evidence or you get nothing.”


8. Extraction Roadmap

Phase 1: Core Package (2 weeks)

Extract the generic components into rag-harness/:

  • Provider interfaces (SearchClient, LLMClient, EmbeddingClient, ParserClient, RerankerClient)
  • BM25 engine + RRF fusion
  • Citation validator + negation detector
  • Injection detector + blocklist
  • Embedding cache + query cache
  • httpx client pool
  • Cost tracker
  • OTEL + Langfuse setup
  • Pydantic Settings config (replace flat env vars)
  • Tests for all extracted components

Phase 2: Pipeline Orchestrator (1 week)

Build RAGPipeline that composes the providers:

  • pipeline.ingest() — parse + chunk + embed + index
  • pipeline.ask() — search + verify + grade + cite
  • Extension points for domain-specific grading, verification prompts, response schemas
  • End-to-end test with local providers (no Azure dependency)

Phase 3: Missing Capabilities (3-4 weeks)

Build the gaps identified in the enterprise blueprint:

  • Document quality router
  • Table-aware processing (detect + extract + dual embed)
  • Metadata-enriched retrieval
  • Agentic retrieval loop (multi-turn with refinement)
  • Hierarchical document structure (TreeRAG-lite)

Phase 4: Domain Templates (1 week per domain)

Create example configurations:

  • examples/legal/ — Evidence-Bound config (already exists)
  • examples/medical/ — Clinical RAG config
  • examples/financial/ — Regulatory compliance config
  • examples/quickstart.py — Minimal working example

9. Answer: Do We Have a Repeatable Process?

Yes, about 70% of one.

Evidence-Bound accidentally built a good RAG harness while building a legal product. The provider abstractions are excellent. The security, observability, and citation validation are ahead of most open-source RAG frameworks. The caching, cost tracking, and parallel verification are production-grade.

What’s missing is the document intelligence layer (quality routing, table processing, hierarchical parsing) and the agentic loop (multi-turn refinement). These are the pieces that separate a demo from a system that works on real enterprise data.

The extraction is worth doing. The codebase is clean enough to refactor without rewriting. And the resulting harness would be genuinely differentiated: not another “wrapper around LangChain,” but a trust-first RAG foundation where every answer is either evidence-backed or refused.