Agentic RAG in Practice: From Single-Pass Retrieval to a Self-Correcting Pipeline

Basic breaks in predictable ways. Here's how we replaced a pile of conditional patches with an agentic graph pipeline, and what it actually cost in latency, complexity, and accuracy.

Every problem in basic RAG can be patched: add a condition for this case, add a check for that one. The problem is that patches accumulate until you end up with a pile of conditional logic that nobody wants to touch.

This post describes the transition from basic RAG to agentic RAG in a real chatbot: why we made the switch, how the architecture was implemented, and the issues we ran into along the way.

Basic RAG and where it breaks down

Basic RAG is straightforward:

Receive the user’s question
Embed the question
KNN search in vector DB, retrieve top-k chunks
Stuff chunks into the system prompt
Stream the response

Works well in most cases. But when deployed on a chatbot handling a wide variety of queries, some problems show up:

Running RAG on greetings. “Hello!”, “Thank you!” still get embedded and searched against the vector DB. Unnecessary API calls.

Multi-topic questions. “Compare ERP pricing and implementation timeline” needs two types of information from two different parts of the KB. A single embedding over a broad multi-topic query often leads to diluted retrieval results.

No recovery from bad retrieval. KNN returns irrelevant chunks → the LLM uses them anyway. No mechanism to say “these results are no good, try again.”

Ticket ID mismatch. User asks about ticket #1234 → vector search returns #1233, #1235 (same project, close distance, but wrong tickets). Vector similarity alone cannot reliably detect this kind of mismatch.

We patched basic RAG one problem at a time: add live fallback, add ticket ID checking, add query rewriting. Eventually the code became fragile. That’s when we switched to an agentic approach.

Core idea

“Agentic” here does not mean a fully autonomous multi-tool agent. It’s simpler than that: instead of hardcoding “search on every question,” let the LLM decide:

Do we need to search? (Don’t search for greetings)
What to search for? (Decompose multi-topic questions into multiple queries)
Are the results good enough? (Grade before using)
If not, try differently. (Rewrite and retry)

The pipeline shifts from a linear flow to a graph-based loop:

[Decide: search or respond directly?]
         │
    [Search (parallel, multiple queries)]
         │
    [Grade: are chunks relevant?]
         │
    ┌────┴────────────────────────────────────┐
    │ Yes                                     │ No, attempts remaining
    ▼                                         ▼
[Stream answer]                    [Rewrite query] → [Search again]

Graph pipeline: no heavy framework needed

The graph is implemented as a small custom runner: each processing step is an independent node that receives shared state and returns the name of the next node, or signals completion. There’s a safety cap to prevent infinite loops.

The reason for not using LangGraph or similar frameworks isn’t that the pipeline is too small; it’s that the pipeline needs deep customization around its own internal logic. Per-query grading needs to know which chunk came from which query. Live fallback only triggers on specific signals per source. Context signals from the grading node need to be read correctly by the answer streamer. These couplings are very specific, and expressing these interactions through a framework’s abstraction layer would be more complex than implementing the runner directly.

Node 1: LLM decides whether to search or respond directly

This node uses the LLM’s tool use capability. The LLM is given a search tool and instructed:

Call search when the question requires factual information from the knowledge base
Respond directly for greetings, casual conversation, or questions that general knowledge can handle

Parallel tool calls: The LLM can call the tool multiple times in a single response. “Compare ERP pricing and implementation timeline” → two parallel search queries, each targeting one topic.

Problem encountered: the LLM sometimes generates overly generic queries, or gets influenced by conversation history (using “this project” instead of the actual name). Solutions:

Inject the page URL into the system prompt → the LLM resolves vague context from the URL
Expand abbreviations before sending: abbreviated terms are expanded to their full forms, to avoid noisy embeddings
Emphasize that queries must be standalone: include specific names and topics from the conversation, no vague pronouns

This preparation step directly affects retrieval quality in the next node.

Node 2: Parallel retrieval and deduplication

For N queries from Node 1, embedding and vector search run in parallel; one query failing doesn’t cancel the whole batch.

Results are merged: deduped by content (keeping the lowest-distance copy when there are duplicates), sorted by distance, and capped at a total limit. The chunk budget doesn’t grow linearly with the number of queries.

Fast-path: If there’s only 1 query and all chunks have very high confidence (similarity scores comfortably above the confidence threshold), grading is skipped and the pipeline goes straight to generating the answer. Saves ~150ms for clear-cut questions.

After merging, each chunk retains information about which query retrieved it, which is necessary for Node 3 to grade each chunk against the right query and know exactly which queries need supplementing.

Node 3: Per-query grading and live fallback

The most complex node. It doesn’t just grade; it’s also the decision point for when to call live data sources.

Per-query grading instead of global grading: Grade each query’s chunks against that specific query. If query A has relevant chunks but query B doesn’t → only trigger supplementary search for query B, don’t abort everything.

This is the most important design decision in Node 3. Global grading can’t tell you which query needs supplementing; it just says “not enough” without pointing to the right place.

Signal-gated live fallback: Not every failed query triggers external search. Each live data source is only activated when the query contains the appropriate domain signal: ticket search only runs when the question mentions a specific ticket, module search only runs when the question is about Odoo modules. This significantly reduces unnecessary external calls.

Fail-open grading: If the grader LLM call errors, treat the document as relevant. Prefer not missing useful information over filtering too aggressively.

After Node 3 finishes, the pipeline knows exactly: is there enough context, which queries came up short, and whether a rewrite is needed.

Node 4: Rewrite from a different angle

When all attempts fail (local grading comes up empty and live search returns nothing), the query is rewritten for another try.

The common problem with naïve rewriting: it produces a synonym of the previous query, which doesn’t help. The rewriter is instructed to reformulate the query from a substantially different angle, with no repeating words or concepts already tried, starting from the original user question rather than the failed query.

Tried queries are tracked across attempts to avoid repetition. After the maximum number of attempts, the pipeline proceeds with a “no context” signal so the LLM knows to acknowledge the gap instead of guessing.

In practice: the rewriter still occasionally produces queries that are too similar. There’s no way to hard-enforce “completely different”; it relies on the LLM following the instruction.

Answer streamer: consolidate and generate

No matter how many attempts the pipeline goes through, the final step is always the same: consolidate everything retrieved and generate the answer.

After the graph completes, the system prompt is built based on the full pipeline result:

If relevant chunks exist: inject them into context
If no context: the LLM is instructed not to speculate, acknowledge the gap
If only partial context: the LLM acknowledges what’s missing
If Node 1 decided no search was needed: respond directly without RAG context

Regardless of state, the LLM is always reminded: if the context is about a completely different topic than the question, don’t force an answer from it. LLMs have a strong tendency to try to answer even when the context is irrelevant; this acts as the final safeguard against that behavior.

Practical issues that aren’t obvious from design

The architecture above works exactly as designed. But some things only show up when running against real data.

Chunking quality determines retrieval quality. Before embedding, content needs to be cleaned: strip HTML, remove base64 blobs, resolve template shortcodes, filter out logo/icon images (short alt text with no caption → skip). If chunks contain HTML artifacts or base64 strings, embeddings become noisy and the grader rejects most of them → unnecessary LLM calls.

External connection lifecycle. Connections to external data sources are kept persistent (not recreated per request) because startup can cost a few seconds. But connections can crash or disconnect. Solution: lazy init, auto-invalidate and reconnect on exception, limit concurrent calls to avoid overload.

Latency trade-off. The agentic pipeline adds significant latency compared to basic RAG (the figures below are rough design estimates, not production benchmarks):

Node	Estimated latency
Node 1 (decision)	~200ms
Node 2 (retrieve)	~80ms
Node 3 (grade, parallel)	~150ms
Node 4 (rewrite, if triggered)	~180ms

Worst case with one rewrite loop: ~840ms before the first token. This is time traded for accuracy: the pipeline checks and corrects before generating, instead of immediately answering from bad chunks. Fast-path and a typing indicator help reduce perceived latency for straightforward questions.

What agentic RAG doesn’t solve

The issues above can all be addressed by tuning the implementation. But there are more fundamental limitations, not bugs to fix, but architectural tradeoffs.

Stale data in the vector DB. The grader can still accept outdated chunks if the topic matches; the information is just stale. Live fallback mitigates this but only triggers when the query carries the right signals.

Retrieval recall is bounded by embedding. If the KB uses different terminology than the user, cosine distance will be high even when the topic matches. Without hybrid search (BM25 + vector), keyword-based recall still has gaps.

The grader can hallucinate relevance. With long chunks and vague questions, the grader sometimes accepts chunks that only overlap on superficial keywords. Truncating content before sending it to the grader helps focus on the topic, but doesn’t fully solve it.

Cost increases significantly. Agentic RAG uses more LLM calls per turn: 1 decision call + N grading calls + 1 rewriter call if it fails. With a cheap model it’s acceptable. With a more expensive model, the math changes.

These limitations aren’t just rough edges to sand down; they’re structural properties of how knowledge bases are built today. Stale data, vocabulary mismatch, and hallucinated relevance all trace back to the same root: the knowledge base is static, indexed once, and never updated based on what users actually ask. That’s also the premise behind a new class of solutions that rethink knowledge base construction entirely. Hindsight is one example worth looking at. (More on this in the next post.)

Summary

Knowing these limitations upfront helps set the right expectations; agentic RAG isn’t a silver bullet, it only solves what it was designed to solve.

Agentic RAG solves exactly three core problems with basic RAG:

Not every question needs a search → decision node
Retrieval doesn’t guarantee relevance → grading node
One attempt isn’t enough → rewrite loop

Live fallback, abbreviation expansion, and context signals are the engineering work that makes it function in real conditions, not just on benchmarks. Benchmarks don’t have ticket ID mismatches, domain-specific abbreviations, or data freshness requirements.

If starting over, I’d still start with basic RAG and only switch to agentic when there’s concrete evidence that basic RAG isn’t enough. Start simple. Add agentic behavior only when the failure modes justify the added complexity.