When one search engine isn't enough: vector, BM25, and a coordinator

"RAG = vector DB." That sentence cost me two weeks. The query that broke Akopia wasn't semantic — it was searching for the literal error code E_FOLIO_EXPIRED. Cosine similarity has no idea what to do with that.

I want to write this one because the lesson took me longer than it should have, and I keep meeting teams who are about to make the same mistake. The shorthand "RAG equals vector database" has propagated so widely that the actual question — which kinds of queries does my corpus get, and what engines do those queries need — almost never gets asked. Akopia (public repo at gitea.rlabs.cl/rlabs-cl/akopia) runs three backends, and the reason is empirical rather than theoretical.

The Day Semantic Search Failed in Front of the Client

The scene is uncomfortable to recall. Internal demo, real audience. The query was "find me everything related to E_FOLIO_EXPIRED" — a specific error code from a service one of the engineers was debugging. The expectation was straightforward: surface every doc, every code comment, every chat log that mentioned that exact identifier.

What Qdrant returned was things vaguely similar to "folio" and "expired" as concepts. Documentation about expiration policies in general. Notes about folio management that didn't contain the error string anywhere. Conceptually adjacent material, semantically reasonable, completely useless for the actual task. The error code itself appeared in maybe three of the top fifty results, and only because some of those docs happened to discuss similar things.

The lesson, sitting in that room, was the kind you don't forget. The system did exactly what a vector database is supposed to do — return semantic neighbors. The problem was that semantic neighborhood is the wrong distance metric for an exact identifier. What I needed was a literal match, not a conceptual cousin. And that's when I understood, properly, that vector ≠ search. Vector is one kind of search. It's powerful within its zone. It collapses when asked to operate outside it.

Three Backends, Three Different Jobs

The response, once the framing shifted, was to stop treating retrieval as one engine and start treating it as a coordinated trio. Each backend exists for queries the others handle badly.

Qdrant (vector) wins when the query is conceptual. "Where do we handle retries with backoff?" — that question doesn't depend on any specific token being present. The answer might be in a doc that talks about "resilience policies" or "exponential delay" or "jitter" without ever using the word "retry" or "backoff". Vector embeddings collapse those surface differences and surface the conceptually right material. This is the zone where vector earns its reputation.

Meilisearch (BM25 with typo tolerance) wins when the query is an identifier — a function name, a class name, a file path, an error string, a config key. E_FOLIO_EXPIRED. BillingService.process_refund. kubectl apply. These tokens have no useful conceptual neighborhood. They mean exactly themselves and nothing else, and what you want back is every place that exact string appears. BM25 handles this trivially. Vector handles it worse than a grep.

Redis isn't a search backend — and I want to be precise about that because the naming gets confusing. Redis is the coordinator. It runs the ingest job stream (the embeddings worker pulls from it), the hot-query cache (so the same query from the same session doesn't recompute), the deduplication layer, and the pub/sub that invalidates cache keys when a namespace gets reindexed. Without Redis, I'd be reimplementing three different coordination mechanisms in Python, badly. With Redis, all three are someone else's well-tested problem.

How the Coordinator Decides Who to Ask

The routing logic is deliberately simple, because complex routing logic is the kind of thing that drifts silently. Here's the gist:

def route(query: str) -> List[Backend]:
    if has_identifier_shape(query):
        # underscores, internal capitals, error codes like E_FOLIO_EXPIRED
        return [meilisearch_first, qdrant_second]
    if word_count(query) > 4 and prose_like(query):
        return [qdrant_first, meilisearch_second]
    return [meilisearch, qdrant]  # default both, fuse
# fuse with reciprocal rank fusion — absolute scores incomparable across backends

Three heuristics. If the query has underscores, internal capitals, or otherwise looks identifier-shaped, BM25 goes first. If it's natural prose longer than a few words with no identifier-like tokens, vector goes first. Anything else, both get called and the results merge. The order matters because in tight latency budgets you can short-circuit on a high-confidence first result, but in the normal case both engines run in parallel.

The fusion step is where most hybrid setups quietly go wrong. The temptation is to fuse on raw scores — "this result scored 0.83 from Qdrant, that one scored 0.71 from Meilisearch, so Qdrant's wins". That logic is broken. The scores from two different engines are not on the same scale and not comparable. Reciprocal rank fusion sidesteps the problem entirely: it fuses on rank, not score, so the only thing that matters is "how high did each backend put this result in its own list". The math is two lines. The robustness is enormous.

Redis caches the fused result by query hash, with a deliberately short TTL because the corpus reindexes frequently and a stale cache is worse than a fresh recompute. The cache exists for the burst pattern — the same agent asking the same question three times in a row during a debug loop — not for long-term storage.

What I Learned From Running All Three in Production

Numbers from a month of real agent traffic. Vector only: 18% no-result-or-irrelevant rate. BM25 only: 22%, with a different failure distribution — it missed the conceptual queries the way vector missed the identifier queries. Coordinated hybrid: 7%, and most of the remaining 7% are queries that are genuinely ambiguous, the kind where a human would also need to ask a follow-up.

The operational cost of running all three is roughly 30% more infrastructure than running pure vector, and that's the honest number — it isn't free. What that cost buys, beyond the quality numbers, is reindex flexibility. Meilisearch reindexes in under a minute. Qdrant reindexes in around six minutes. When you need to push a corpus update, you can update one engine at a time, see the quality shift, and roll back the other if something breaks. With a single backend, reindex is an all-or-nothing event.

The quality delta — from 18% failure down to 7% — is the kind of jump that changes what the system feels like to use. An agent that misses one in five queries trains its caller to distrust it. An agent that misses one in fourteen, and where most of those misses are genuinely hard queries, gets used reflexively. The trust curve is non-linear in a way that justifies the 30% infra cost easily.

There's a second-order effect I didn't anticipate. Once the hybrid is in place, you can start measuring which backend contributed to each successful answer, and the distribution tells you something about your corpus. In Akopia's case, identifier-heavy queries — the ones BM25 dominates — are roughly 40% of agent traffic. Pure prose queries — vector territory — are about 35%. The remaining 25% is genuinely mixed, where the fusion is doing real work. That distribution would have been invisible with a single backend. It tells you, concretely, that a vector-only system would have been the wrong tool for almost half of the actual workload.

Why Redis Isn't Optional Even Though It Looks Trivial

I keep getting the question "why not just use a queue library and an in-memory cache". The honest answer is that you could, and after a quarter of running it you'd have rebuilt Redis. Worse.

Streams for embedding jobs: the worker pulls, doesn't push. That inversion matters because the embedder has a clamped batch size, and a push-based pipeline would have to enforce backpressure somewhere. With Redis streams, the worker pulls when it's ready, and the producer simply enqueues. Failure on the worker side is non-fatal — the message stays in the stream, gets retried, doesn't block the API. The whole back-pressure problem dissolves into Redis's built-in semantics.

Fusion cache: the same query from the same agent in the same session is more common than people expect. Debug loops, conversational refinements, the agent re-asking after a partial answer. Caching the fused result for a short window cuts compute and reduces latency for the common case. The TTL is short by design — the corpus reindexes, the answers shift, stale results would be a quality regression worse than the saved CPU.

Pub/sub for invalidation: when a repo gets reindexed, the cache keys for that namespace get invalidated atomically. Without pub/sub, you'd have either stale cache (bad) or no cache (worse). With pub/sub, the invalidation is fan-out and almost free.

The three jobs — coordination, cache, invalidation — are different in shape but the same in pattern: they all need an out-of-process, durable, fast-to-talk-to coordination point. Redis is the right size for all three, and using one piece of infrastructure to do three jobs is dramatically less complex than running three.

Transferable Principle

The shorthand "RAG equals vector DB" sells well and breaks fast. It sells well because it fits on a slide and matches the pitch of every vector-DB vendor. It breaks fast because real corpora get real queries, and real queries are heterogeneous.

The right question to ask, before picking a backend, isn't "which search engine do I use". It's two questions, in order: what kinds of queries will my system get, and how many engines do I need so that none of them is being asked to operate outside its zone? Answer those honestly and the architecture follows. Skip them and the architecture follows the loudest vendor.

If your RAG today is vector-only, the test takes thirty seconds. Pick an exact-identifier query from your corpus — an error code, a function name, an SKU, whatever your domain has. Run it. If the system gives you back conceptually-adjacent garbage instead of the literal hit, you already know what's missing. You don't need anyone's permission to add a lexical backend, and the marginal infrastructure cost is a fraction of what you'll get back in query quality.

What was the query that broke your RAG and forced you to rethink the architecture? Mine was an error code. Tell me yours — send me a DM or reach out via the contact channels at rlabs.cl.

#AI #Engineering #Architecture #OpenSource #PythonDev