← Back to Blog
Engineering by starFeatured

Agentic RAG in Practice: From Single-Pass Retrieval to a Self-Correcting Pipeline

Agentic RAG in Practice: From Single-Pass Retrieval to a Self-Correcting Pipeline

Basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… breaks in predictable ways. Here's how we replaced a pile of conditional patches with an agentic graph pipeline, and what it actually cost in latency, complexity, and accuracy.

Every problem in basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… can be patched: add a condition for this case, add a check for that one. The problem is that patches accumulate until you end up with a pile of conditional logic that nobody wants to touch.

This post describes the transition from basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… to agentic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… in a real chatbot: why we made the switch, how the architecture was implemented, and the issues we ran into along the way.


Basic RAG and where it breaks down

Basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… is straightforward:

  1. Receive the user’s question
  2. Embed the question
  3. KNN search in vector DB, retrieve top-k chunks
  4. Stuff chunks into the system promptAn instruction provided to an LLM before the user's message, typically used to set the model's behavior, persona, tone, and constraints. System prompts are invisible to end users and are a key tool…
  5. Stream the response

Works well in most cases. But when deployed on a chatbot handling a wide variety of queries, some problems show up:

Running RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… on greetings. “Hello!”, “Thank you!” still get embedded and searched against the vector DB. Unnecessary API calls.

Multi-topic questions. “Compare ERP pricing and implementation timeline” needs two types of information from two different parts of the KB. A single embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic… over a broad multi-topic query often leads to diluted retrieval results.

No recovery from bad retrieval. KNN returns irrelevant chunks → the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… uses them anyway. No mechanism to say “these results are no good, try again.”

Ticket ID mismatch. User asks about ticket #1234 → vector search returns #1233, #1235 (same project, close distance, but wrong tickets). Vector similarity alone cannot reliably detect this kind of mismatch.

We patched basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… one problem at a time: add live fallback, add ticket ID checking, add query rewriting. Eventually the code became fragile. That’s when we switched to an agentic approach.


Core idea

“Agentic” here does not mean a fully autonomous multi-tool agent. It’s simpler than that: instead of hardcoding “search on every question,” let the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… decide:

  • Do we need to search? (Don’t search for greetings)
  • What to search for? (Decompose multi-topic questions into multiple queries)
  • Are the results good enough? (Grade before using)
  • If not, try differently. (Rewrite and retry)

The pipeline shifts from a linear flow to a graph-based loop:

[Decide: search or respond directly?]
         │
    [Search (parallel, multiple queries)]
         │
    [Grade: are chunks relevant?]
         │
    ┌────┴────────────────────────────────────┐
    │ Yes                                     │ No, attempts remaining
    ▼                                         ▼
[Stream answer]                    [Rewrite query] → [Search again]

Graph pipeline: no heavy framework needed

The graph is implemented as a small custom runner: each processing step is an independent node that receives shared state and returns the name of the next node, or signals completionThe text output generated by an LLM in response to a prompt. Also called a response or generation. LLMs generate completions by predicting the most likely next token given the context.. There’s a safety cap to prevent infinite loops.

The reason for not using LangGraph or similar frameworks isn’t that the pipeline is too small; it’s that the pipeline needs deep customization around its own internal logic. Per-query grading needs to know which chunk came from which query. Live fallback only triggers on specific signals per source. Context signals from the grading node need to be read correctly by the answer streamer. These couplings are very specific, and expressing these interactions through a framework’s abstraction layer would be more complex than implementing the runner directly.


Node 1: LLM decides whether to search or respond directly

This node uses the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,…’s tool useThe ability of an LLM to invoke external functions or APIs as part of generating a response. The model decides when and how to call a tool (e.g., search, calculator, database query) and incorporates… capability. The LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… is given a search tool and instructed:

  • Call search when the question requires factual information from the knowledge base
  • Respond directly for greetings, casual conversation, or questions that general knowledge can handle

Parallel tool calls: The LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… can call the tool multiple times in a single response. “Compare ERP pricing and implementation timeline” → two parallel search queries, each targeting one topic.

Problem encountered: the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… sometimes generates overly generic queries, or gets influenced by conversation history (using “this project” instead of the actual name). Solutions:

  • Inject the page URL into the system promptAn instruction provided to an LLM before the user's message, typically used to set the model's behavior, persona, tone, and constraints. System prompts are invisible to end users and are a key tool… → the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… resolves vague context from the URL
  • Expand abbreviations before sending: abbreviated terms are expanded to their full forms, to avoid noisy embeddings
  • Emphasize that queries must be standalone: include specific names and topics from the conversation, no vague pronouns

This preparation step directly affects retrieval quality in the next node.


Node 2: Parallel retrieval and deduplication

For N queries from Node 1, embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic… and vector search run in parallel; one query failing doesn’t cancel the whole batch.

Results are merged: deduped by content (keeping the lowest-distance copy when there are duplicates), sorted by distance, and capped at a total limit. The chunk budget doesn’t grow linearly with the number of queries.

Fast-path: If there’s only 1 query and all chunks have very high confidence (similarity scores comfortably above the confidence threshold), grading is skipped and the pipeline goes straight to generating the answer. Saves ~150ms for clear-cut questions.

After merging, each chunk retains information about which query retrieved it, which is necessary for Node 3 to grade each chunk against the right query and know exactly which queries need supplementing.


Node 3: Per-query grading and live fallback

The most complex node. It doesn’t just grade; it’s also the decision point for when to call live data sources.

Per-query grading instead of global grading: Grade each query’s chunks against that specific query. If query A has relevant chunks but query B doesn’t → only trigger supplementary search for query B, don’t abort everything.

This is the most important design decision in Node 3. Global grading can’t tell you which query needs supplementing; it just says “not enough” without pointing to the right place.

Signal-gated live fallback: Not every failed query triggers external search. Each live data source is only activated when the query contains the appropriate domain signal: ticket search only runs when the question mentions a specific ticket, module search only runs when the question is about Odoo modules. This significantly reduces unnecessary external calls.

Fail-open grading: If the grader LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… call errors, treat the document as relevant. Prefer not missing useful information over filtering too aggressively.

After Node 3 finishes, the pipeline knows exactly: is there enough context, which queries came up short, and whether a rewrite is needed.


Node 4: Rewrite from a different angle

When all attempts fail (local grading comes up empty and live search returns nothing), the query is rewritten for another try.

The common problem with naïve rewriting: it produces a synonym of the previous query, which doesn’t help. The rewriter is instructed to reformulate the query from a substantially different angle, with no repeating words or concepts already tried, starting from the original user question rather than the failed query.

Tried queries are tracked across attempts to avoid repetition. After the maximum number of attempts, the pipeline proceeds with a “no context” signal so the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… knows to acknowledge the gap instead of guessing.

In practice: the rewriter still occasionally produces queries that are too similar. There’s no way to hard-enforce “completely different”; it relies on the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… following the instruction.


Answer streamer: consolidate and generate

No matter how many attempts the pipeline goes through, the final step is always the same: consolidate everything retrieved and generate the answer.

After the graph completes, the system promptAn instruction provided to an LLM before the user's message, typically used to set the model's behavior, persona, tone, and constraints. System prompts are invisible to end users and are a key tool… is built based on the full pipeline result:

  • If relevant chunks exist: inject them into context
  • If no context: the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… is instructed not to speculate, acknowledge the gap
  • If only partial context: the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… acknowledges what’s missing
  • If Node 1 decided no search was needed: respond directly without RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… context

Regardless of state, the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… is always reminded: if the context is about a completely different topic than the question, don’t force an answer from it. LLMs have a strong tendency to try to answer even when the context is irrelevant; this acts as the final safeguard against that behavior.


Practical issues that aren’t obvious from design

The architecture above works exactly as designed. But some things only show up when running against real data.

ChunkingThe process of splitting a large document into smaller, overlapping or non-overlapping pieces (chunks) before embedding and indexing. Chunk size and overlap are important parameters in RAG pipelines… quality determines retrieval quality. Before embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic…, content needs to be cleaned: strip HTML, remove base64 blobs, resolve template shortcodes, filter out logo/icon images (short alt text with no caption → skip). If chunks contain HTML artifacts or base64 strings, embeddings become noisy and the grader rejects most of them → unnecessary LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… calls.

External connection lifecycle. Connections to external data sources are kept persistent (not recreated per request) because startup can cost a few seconds. But connections can crash or disconnect. Solution: lazy init, auto-invalidate and reconnect on exception, limit concurrent calls to avoid overload.

Latency trade-off. The agentic pipeline adds significant latency compared to basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… (the figures below are rough design estimates, not production benchmarks):

Node Estimated latency
Node 1 (decision) ~200ms
Node 2 (retrieve) ~80ms
Node 3 (grade, parallel) ~150ms
Node 4 (rewrite, if triggered) ~180ms

Worst case with one rewrite loop: ~840ms before the first tokenThe basic unit of text processed by an LLM. A token is roughly 4 characters or 0.75 words in English. LLMs process and generate text as sequences of tokens. Tokenization varies by model and language.. This is time traded for accuracy: the pipeline checks and corrects before generating, instead of immediately answering from bad chunks. Fast-path and a typing indicator help reduce perceived latency for straightforward questions.


What agentic RAG doesn’t solve

The issues above can all be addressed by tuning the implementation. But there are more fundamental limitations, not bugs to fix, but architectural tradeoffs.

Stale data in the vector DB. The grader can still accept outdated chunks if the topic matches; the information is just stale. Live fallback mitigates this but only triggers when the query carries the right signals.

Retrieval recall is bounded by embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic…. If the KB uses different terminology than the user, cosine distance will be high even when the topic matches. Without hybrid search (BM25 + vector), keyword-based recall still has gaps.

The grader can hallucinate relevance. With long chunks and vague questions, the grader sometimes accepts chunks that only overlap on superficial keywords. Truncating content before sending it to the grader helps focus on the topic, but doesn’t fully solve it.

Cost increases significantly. Agentic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… uses more LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… calls per turn: 1 decision call + N grading calls + 1 rewriter call if it fails. With a cheap modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… it’s acceptable. With a more expensive modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or…, the math changes.

These limitations aren’t just rough edges to sand down; they’re structural properties of how knowledge bases are built today. Stale data, vocabulary mismatch, and hallucinated relevance all trace back to the same root: the knowledge base is static, indexed once, and never updated based on what users actually ask. That’s also the premise behind a new class of solutions that rethink knowledge base construction entirely. Hindsight is one example worth looking at. (More on this in the next post.)


Summary

Knowing these limitations upfront helps set the right expectations; agentic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… isn’t a silver bullet, it only solves what it was designed to solve.

Agentic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… solves exactly three core problems with basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to…:

  1. Not every question needs a search → decision node
  2. Retrieval doesn’t guarantee relevance → grading node
  3. One attempt isn’t enough → rewrite loop

Live fallback, abbreviation expansion, and context signals are the engineering work that makes it function in real conditions, not just on benchmarks. Benchmarks don’t have ticket ID mismatches, domain-specific abbreviations, or data freshness requirements.

If starting over, I’d still start with basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… and only switch to agentic when there’s concrete evidence that basic RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… isn’t enough. Start simple. Add agentic behavior only when the failure modes justify the added complexity.

Ready to put AI to work?

Let's explore how Trobz AI can automate your processes, enhance your ERP, and help your team make better decisions — faster.