002_HYBRID_RAG

5-STAGE
AI MEMORY

Building a production hybrid Retrieval Augmented Generation (RAG) system for AI long-term memory. We replaced naive cosine similarity vector search with a sophisticated 5-stage pipeline that dramatically improved recall and relevance of retrieved memories in complex knowledge retrieval scenarios.

PIPELINE STAGES

RECALL IMPROVEMENT

300%

MEMORIES MANAGED

551+

RELEVANCE SCORE

0.716

THE PROBLEM

Initial RAG implementations relied exclusively on vector similarity search (cosine distance in embedding space). While conceptually elegant, this single-stage approach produced poor real-world results: relevant memories were frequently missed, and irrelevant records ranked highly due to superficial semantic similarity.

The core issue: embeddings capture semantic meaning but lose exact lexical matches and temporal context. A query for "Q3 budget meeting October 2024" would retrieve memories about "quarterly forecasting" due to semantic overlap, but miss the actual meeting notes because the exact date and topic weren't prioritized.

False negatives plagued the system. With 551+ memories in the knowledge base, recall dropped below 0.40. Users had to explicitly mention specific details rather than relying on the AI to find relevant context. The system couldn't distinguish between memories of varying recency, importance, or relevance to ongoing conversations.

THE 5-STAGE PIPELINE

We engineered a multi-stage retrieval pipeline that combines complementary ranking signals [PROTOCOL_HYBRID_FUSION_V3]:

STAGE 1: VEC0 KNN SEARCH

SQLite vec0 extension performs dense vector similarity search. Returns top 50 candidates ranked by embedding cosine distance. Fast (~5ms) and captures semantic relevance.

STAGE 2: FTS5 BM25 TEXT SEARCH

SQLite FTS5 full-text search with BM25 ranking algorithm. Prioritizes exact keyword matches and term frequency. Returns top 50 candidates based on lexical relevance independent of semantic similarity.

STAGE 3: RECIPROCAL RANK FUSION (RRF)

Merge rankings from stages 1 and 2 using RRF formula: 1/(k+rank). Normalizes scores from different algorithms and combines signals. Top 25 memories emerge as candidates for re-ranking.

STAGE 4: TEMPORAL DECAY FILTER

Apply exponential decay function based on memory age. Recent memories boost relevance scores. Prevents stale information from dominating results. Configurable decay curves accommodate different use cases (conversations vs. reference materials).

STAGE 5: CROSS-ENCODER RERANKING

Fine-tuned cross-encoder model re-ranks top 10 candidates by computing query-memory relevance scores. Most computationally expensive but most accurate stage. Final ranking reflects true semantic alignment with query intent.

BEFORE VS. AFTER

NAIVE VECTOR SEARCH

× Recall: 0.38
× Precision: 0.42
× Avg relevance score: 0.19
× Latency: 45ms (GPU inference)
× User satisfaction: 31%
× Cold starts miss obvious context

5-STAGE HYBRID PIPELINE

✓ Recall: 0.88
✓ Precision: 0.72
✓ Avg relevance score: 0.716
✓ Latency: 28ms (multi-stage)
✓ User satisfaction: 84%
✓ Catches subtle contextual references

PRODUCTION RESULTS

RECALL IMPROVEMENT: Increased from 38% to 88% by combining lexical and semantic signals. The system now finds relevant memories that pure embedding search would miss. Users no longer need to manually specify exact details.

RELEVANCE SCORING: Cross-encoder stage produces calibrated relevance scores (0.0-1.0 range). Average score of 0.716 across 551+ memories indicates strong signal quality. Top-10 results have >95% relevance agreement with human raters.

LATENCY OPTIMIZATION: Despite adding stages, end-to-end latency decreased from 45ms to 28ms. Stages 1-4 execute in parallel on CPU (10ms total). Stage 5 (cross-encoder) runs only on top 10 candidates (8ms). Caching of BM25 scores eliminates redundant computation.

KNOWLEDGE SCALE: System manages 551+ memories with consistent performance. Linear scaling properties observed up to 5000+ memories in testing. RRF fusion handles memory addition without retraining embedding models.

USER EXPERIENCE: Satisfaction metrics jumped from 31% to 84%. Users report that the AI "actually remembers" context without explicit reminders. Fewer clarification requests needed during multi-turn conversations.

IMPLEMENTATION: Entirely built on SQLite + open-source components. No vendor lock-in, single-file database, portable deployment. Production cost ~40% lower than equivalent cloud-based solutions.

TECHNICAL DETAILS

The architecture leverages SQLite's vec0 extension for dense vectors and FTS5 for full-text indexing, eliminating dependencies on specialized vector databases while maintaining production-grade performance.

DEPLOYMENT STACK

SQLite with vec0/FTS5 extensions • JSON schema for memory structure • Cross-encoder: multilingual-MiniLMv6 • BM25 decay constant: 1.5 • Temporal decay half-life: 90 days

Memory records store query-response pairs with metadata: timestamp, semantic embeddings (384-dim), lexical indices, and optional tags. RRF parameters optimized for conversational AI where recent, specific context matters most.

Ready to implement intelligent memory systems?