5-STAGE
AI MEMORY
Building a production hybrid Retrieval Augmented Generation (RAG) system for AI long-term memory. We replaced naive cosine similarity vector search with a sophisticated 5-stage pipeline that dramatically improved recall and relevance of retrieved memories in complex knowledge retrieval scenarios.
THE PROBLEM
Initial RAG implementations relied exclusively on vector similarity search (cosine distance in embedding space). While conceptually elegant, this single-stage approach produced poor real-world results: relevant memories were frequently missed, and irrelevant records ranked highly due to superficial semantic similarity.
The core issue: embeddings capture semantic meaning but lose exact lexical matches and temporal context. A query for "Q3 budget meeting October 2024" would retrieve memories about "quarterly forecasting" due to semantic overlap, but miss the actual meeting notes because the exact date and topic weren't prioritized.
False negatives plagued the system. With 551+ memories in the knowledge base, recall dropped below 0.40. Users had to explicitly mention specific details rather than relying on the AI to find relevant context. The system couldn't distinguish between memories of varying recency, importance, or relevance to ongoing conversations.
THE 5-STAGE PIPELINE
We engineered a multi-stage retrieval pipeline that combines complementary ranking signals [PROTOCOL_HYBRID_FUSION_V3]:
BEFORE VS. AFTER
- × Recall: 0.38
- × Precision: 0.42
- × Avg relevance score: 0.19
- × Latency: 45ms (GPU inference)
- × User satisfaction: 31%
- × Cold starts miss obvious context
- ✓ Recall: 0.88
- ✓ Precision: 0.72
- ✓ Avg relevance score: 0.716
- ✓ Latency: 28ms (multi-stage)
- ✓ User satisfaction: 84%
- ✓ Catches subtle contextual references
PRODUCTION RESULTS
RECALL IMPROVEMENT: Increased from 38% to 88% by combining lexical and semantic signals. The system now finds relevant memories that pure embedding search would miss. Users no longer need to manually specify exact details.
RELEVANCE SCORING: Cross-encoder stage produces calibrated relevance scores (0.0-1.0 range). Average score of 0.716 across 551+ memories indicates strong signal quality. Top-10 results have >95% relevance agreement with human raters.
LATENCY OPTIMIZATION: Despite adding stages, end-to-end latency decreased from 45ms to 28ms. Stages 1-4 execute in parallel on CPU (10ms total). Stage 5 (cross-encoder) runs only on top 10 candidates (8ms). Caching of BM25 scores eliminates redundant computation.
KNOWLEDGE SCALE: System manages 551+ memories with consistent performance. Linear scaling properties observed up to 5000+ memories in testing. RRF fusion handles memory addition without retraining embedding models.
USER EXPERIENCE: Satisfaction metrics jumped from 31% to 84%. Users report that the AI "actually remembers" context without explicit reminders. Fewer clarification requests needed during multi-turn conversations.
IMPLEMENTATION: Entirely built on SQLite + open-source components. No vendor lock-in, single-file database, portable deployment. Production cost ~40% lower than equivalent cloud-based solutions.
TECHNICAL DETAILS
The architecture leverages SQLite's vec0 extension for dense vectors and FTS5 for full-text indexing, eliminating dependencies on specialized vector databases while maintaining production-grade performance.
Memory records store query-response pairs with metadata: timestamp, semantic embeddings (384-dim), lexical indices, and optional tags. RRF parameters optimized for conversational AI where recent, specific context matters most.
Ready to implement intelligent memory systems?
Initiate Review