Engineering

How We Built a 5-Stage Hybrid RAG System for AI Long-Term Memory

February 27, 2026

Hybrid RAG memory system architecture diagram

Most AI assistants have the memory of a goldfish. You tell them something important on Monday, and by Wednesday they have forgotten it entirely. We decided to fix that.

This is the production story of upgrading our AI assistant's memory from a naive cosine similarity scan to a 5-stage hybrid retrieval pipeline. The system manages 551+ memories in a single SQLite file and serves as the long-term recall engine for an AI operating across Telegram, Discord, and web interfaces.

Query → vec0 KNN + FTS5 BM25 → RRF Fusion → Temporal Decay → Jina Rerank

The Problem: Semantic Search Alone Is Not Enough

Our original implementation was simple. Store everything as embeddings, run cosine similarity on every query, return the closest matches. It worked for 100 memories. Then it stopped working at 500+.

Three specific failures pushed us to rebuild:

Failure 1: Exact matches got lost. Searching for "TKo@a Fortune 500 client.com" would not reliably surface the memory containing that exact email address, because embeddings encode meaning, not strings. The semantic embedding for "a Fortune 500 client contact email" is close, but it is not the same thing.

Failure 2: Everything was equally old. A meeting scheduled for tomorrow ranked the same as a note from three weeks ago about the same topic. Zero recency signal. The AI treated a three-week-old status update with the same urgency as tomorrow's deadline.

Failure 3: Bi-encoders are weak rankers. In a bi-encoder setup, the query and document get encoded independently. The model never sees them together. Two documents can have similar embeddings to a query but wildly different actual relevance. We needed something that could evaluate relevance by looking at both the query and the document simultaneously.

The Architecture: 5 Stages, Each Solving a Different Problem

Instead of replacing one retriever with a better one, we built a pipeline where each stage addresses a specific weakness. If any stage fails, the pipeline degrades gracefully. Results are still good, just less refined.

Stage 1: Dual Retrieval (Semantic + Keyword)

We run two independent searches against the same query:

vec0 KNN (via sqlite-vec): Finds the 40 nearest semantic neighbors using L2 distance. This is O(1) lookup time via a proper index, not O(n) cosine scanning.
FTS5 BM25 (built into SQLite): Finds the 40 best keyword matches using Okapi BM25 scoring. This catches exact strings, email addresses, proper nouns, and technical terms that embeddings miss.

Both retrievers contribute to the same candidate pool. A query for "TKo@a Fortune 500 client.com email" now gets semantic matches about a Fortune 500 client contacts AND the exact memory containing that email address.

Stage 2: Reciprocal Rank Fusion (RRF)

With two ranked lists from different retrievers, we need to combine them. RRF is an unsupervised fusion algorithm that works on ranks, not scores. This is important because vector distances and BM25 scores are on completely different scales.

RRF_score(d) = SUM( 1 / (k + rank_i(d)) )  for each retriever i

The key property: documents appearing in both retrievers get naturally boosted. A document ranked #1 in both vec0 and FTS5 scores roughly 2x a document ranked #1 in only one. We use k=60 (the standard from the original Cormack et al. paper), which gives uniform blending across ranks.

Stage 3: Temporal Decay

Memories naturally lose urgency over time. A meeting note from today matters more than the same topic from three weeks ago. We apply exponential decay with a 14-day half-life:

score *= (0.7 + 0.3 * 0.5^(age_days / 14))

The 70/30 blend is the critical design choice. The 70% floor ensures that permanent facts (like contact information, preferences, key decisions) are never buried by recency alone. The 30% recency component is the tiebreaker when relevance is close.

Age	Multiplier	Effect
Today	1.000x	Full score
7 days	0.912x	Slight reduction
14 days	0.850x	15% reduction
28 days	0.775x	22.5% reduction
56 days	0.719x	28% reduction (approaching floor)

Stage 4: Cross-Encoder Reranking (Jina)

This is where the quality jump happens. Bi-encoders encode query and document independently. Cross-encoders process them together with full attention. The quality difference is dramatic.

We send the top candidates to Jina Reranker v2 (multilingual cross-encoder), which returns calibrated 0-to-1 relevance scores. The reranker correctly identifies that "Tracey Cranz meeting scheduled Feb 19 at 4pm" is far more relevant to the query "a Fortune 500 client meeting" than a file path reference, even though the file path had stronger keyword overlap.

Before (cosine only)

Top result: a file path reference
[0.682] a Fortune 500 client-mdm-proposal-2026-02-25.md

After (full pipeline)

Top result: actual meeting details
[0.716] Tracey Cranz meeting Feb 19 at 4pm...

Stage 5: Circuit Breaker

The Jina reranker is an external API. The system must never fail because of it. A circuit breaker pattern ensures resilience:

429 rate limit: 60-second backoff
5xx server error: 30-second backoff
Timeout (>10s): 15-second backoff
Connection error: 30-second backoff

During backoff, reranking is skipped entirely. The RRF + temporal decay results are returned directly. The system degrades through four levels, all the way down to basic keyword search if everything else fails. Each stage has independent error handling, so a failure in any one stage never cascades to others.

The Write Pipeline: Triple-Store with O(1) Dedup

Every new memory gets written to three stores simultaneously:

Primary relational store (SQLite memories table) for metadata, filtering, and source of truth
vec0 KNN index (memories_vec) for fast semantic retrieval
FTS5 index (auto-synced via SQLite triggers on INSERT, UPDATE, DELETE)

Before writing, we check for near-duplicates using a single vec0 KNN query for the nearest neighbor. If cosine similarity exceeds 0.95, we skip the write and update the timestamp instead. This replaced an O(n) full-scan dedup that previously loaded every single vector into memory.

Dedup Before

Load ALL 551 vectors into memory. Iterate each one. Compute cosine similarity. O(n) per write.

Dedup After

Single vec0 KNN query. Top-1 nearest neighbor. Check distance threshold. O(1) per write.

FTS5 Auto-Sync: Zero-Maintenance Keyword Index

The FTS5 table runs in SQLite's content-sync mode, which means it does not store its own copy of the text. It reads from the primary memories table on demand. Three triggers keep the inverted index synchronized automatically:

-- On INSERT: index new memory
CREATE TRIGGER memories_fts_ai AFTER INSERT ON memories BEGIN
    INSERT INTO memories_fts(rowid, content)
    VALUES (new.id, new.content);
END;

-- On DELETE: remove from index
CREATE TRIGGER memories_fts_ad AFTER DELETE ON memories BEGIN
    INSERT INTO memories_fts(memories_fts, rowid, content)
    VALUES('delete', old.id, old.content);
END;

-- On UPDATE: re-index
CREATE TRIGGER memories_fts_au AFTER UPDATE ON memories BEGIN
    INSERT INTO memories_fts(memories_fts, rowid, content)
    VALUES('delete', old.id, old.content);
    INSERT INTO memories_fts(rowid, content)
    VALUES (new.id, new.content);
END;

On first run, all existing memories get backfilled into the FTS index automatically. The process is idempotent, so you cannot accidentally double-index.

The Measured Results

551

memories indexed across all three stores (relational + vec0 + FTS5), fully synchronized

Search quality on a real query ("a Fortune 500 client meeting") tells the complete story:

Pipeline	Top Result	Score	Useful?
Cosine only (before)	File path reference	0.682	No
Hybrid RRF (no reranking)	File path + some meetings	0.030	Mixed
Full pipeline (with Jina)	Actual meeting with people, time, action items	0.716	Yes

The reranker correctly surfaced "Tracey Cranz meeting scheduled Feb 19 at 4pm" over file path metadata. That is the difference between an AI that retrieves documents and an AI that actually remembers things.

Performance summary:

Search: O(1) via vec0 KNN (was O(n) full scan)
Dedup: O(1) via nearest-neighbor check (was O(n) all-vectors scan)
Keyword recall: FTS5 catches exact matches embeddings miss entirely
Resilience: Circuit breaker means the system never blocks on API failures

The Tech Stack

Component	Technology	Why
Database	SQLite 3.x	Zero-config, single file, embedded
Vector index	sqlite-vec (vec0)	KNN via virtual table, L2 distance
Full-text	SQLite FTS5	Built-in BM25, content-sync triggers
Embeddings	qwen3-embedding-8b	1024 dims, 32K context, via OpenRouter
Reranker	Jina Reranker v2	Cross-encoder, multilingual, 0-1 scores
Language	Python 3.12	Standard library + requests

The entire system runs in a single SQLite file at roughly 15MB for 551 memories. No Postgres. No Redis. No Elasticsearch. No managed vector database service. Just SQLite with the right extensions.

Tuning Constants That Actually Matter

Every number in this system was chosen for a specific reason:

RRF k=60: Standard from the original Cormack et al. paper. We tested k=1 (too aggressive, top results dominate) and k=1000 (too flat, ranks become meaningless).
14-day half-life: Matches the typical business context window. Most actionable memories are within two weeks. Aligns with the OpenClaw framework's built-in temporal decay.
70/30 decay blend: The 70% floor ensures permanent facts are never buried. The 30% recency component breaks ties. We tried 50/50 and it was too aggressive on older memories.
0.95 dedup threshold: At cosine similarity 0.95 or above, content is near-identical. This catches rephrased duplicates while allowing genuinely different memories about the same topic.
40 candidates per retriever: With 551 total memories, 40 covers ~14% of the corpus per retriever. Wide enough for high recall, narrow enough to keep fusion fast.

What We Would Do Differently

If we were building this from scratch:

Start with FTS5 from day one. The keyword index catches so many cases that pure semantic search misses. It is essentially free to add with SQLite and the content-sync triggers make maintenance automatic.
Build the circuit breaker before adding the reranker. External API dependencies need a fallback plan from the start. We built them in the right order, but we have seen systems that add an external dependency without any degradation path.
Use content-sync FTS5 from the start. We initially considered a standard FTS5 table that stores its own copy of the text. Content-sync mode is better: no data duplication, triggers handle everything, and the primary table remains the single source of truth.

The Takeaway

Hybrid retrieval is not optional anymore. Pure semantic search misses exact matches. Pure keyword search misses meaning. Combining them with RRF fusion, temporal awareness, and cross-encoder reranking produces results that are qualitatively different from any single retriever.

The full pipeline is under 400 lines of Python. Every stage is independently valuable and independently fault-tolerant. If the reranker goes down, you still get great results from RRF + decay. If embeddings fail, FTS5 carries the load. If both fail, the system returns empty gracefully instead of crashing.

The goldfish memory era for AI assistants is over.

Need Help Building AI Memory Systems?

We build production RAG pipelines, data infrastructure, and AI systems for enterprises. From architecture through deployment.

Let's Talk

Edin Campara is the founder of BASAWE LLC, where we help enterprises build data infrastructure, AI systems, and digital transformation initiatives. This system was built as part of the OpenClaw AI agent framework.

Ready to initiate the // shift?

Contact Command