← Back to Blog
Strategy by

The PoC-to-Production Gap: What Changes When Real Users Touch Your AI

The PoC-to-Production Gap: What Changes When Real Users Touch Your AI

A PoC that works on your laptop is not a production system. The gap between the two is wider than most stakeholders expect, and it shows up in six predictable places.

Key Takeaways: The PoC-to-production gap is not a matter of polish — it is a matter of engineering. Six dimensions grow substantially between a demo and a deployed system: latency, error handling, monitoring, retraining, authorization, and trust. Each is a concrete engineering task, not a vague concern. Teams that don’t plan for them before the PoC ends will spend months retrofitting the basics after go-live.

The demo went well. Stakeholders are excited. The PoC proved the hypothesis. Now someone says “great, let’s productionize it” — and the room goes quiet, because everyone who built the PoC knows something the stakeholders don’t: what they saw was a controlled experiment, not a system.

This gap is real and it is predictable. It shows up the same way on nearly every AI project. Understanding it before you commit to production is how you avoid the slow-motion failure of deploying a PoC that was never meant to handle the real world.

Latency: Your Laptop Is Not a Production Server

In a PoC, latency is usually fine. The modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… runs locally, the database query returns in milliseconds, and the demo flows smoothly. Production is different.

A RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… pipeline that takes 800ms on a developer’s machine often takes 3–5 seconds under concurrent load. Add a re-ranking step, a second LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… call for answer synthesis, and a round-trip to an Odoo knowledge.article record, and you can push past 10 seconds — which is the point where users stop waiting and start complaining.

The causes are predictable:

  • Vector similarity queries in pgvector degrade without proper indexing on the embedding column (use ivfflat or hnsw)
  • LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… API calls queue when burst limits are hit
  • The PoC may have been running on GPU-enabled hardware; production may not

The fix isn’t always more hardware. Often it’s query optimization, caching embeddings for stable content, and moving LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… calls off the user-facing request path where possible. But the fix has to be designed — it doesn’t happen automatically when you deploy.

Rough effort: 2–5 days of profiling and optimization for a typical RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… pipeline.

Error Handling: The Invisible Failures

PoCs fail loudly, or they don’t fail at all — because the PoC was run on a curated dataset by the people who built it. Production is different. Real users try things that weren’t anticipated. Documents arrive in formats the pipeline doesn’t expect. Queries arrive in languages the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… handles poorly.

A PoC that returned an empty result when the retrieval failed is not a production system. Production needs:

  • A fallback path when the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… returns nothing (route to human review, surface a “no results found” message with a contact option, or return the top unfiltered result with a confidence warning)
  • A graceful degradation when a dependency is unavailable (the embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic… service goes down; the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… API returns a 429)
  • Logging of every failure with enough context to reproduce it

The difference between failing loudly and failing silently is enormous in a business workflow. An invoice matching agent that silently skips a record creates reconciliation errors that surface weeks later. One that logs the skip and flags it for review is manageable.

This is closely tied to the data quality argument we’ve made elsewhere — the integrity of your underlying data shapes what failures are even possible. But even clean data produces edge cases. The fallback architecture has to be designed before you deploy, not after.

Rough effort: 3–7 days depending on the number of failure modes in scope.

Monitoring: What Nobody Built Into the PoC

The PoC had no observability. There was no dashboard, no alert, no audit log. The team ran it manually, looked at the outputs, and called it done.

Production needs monitoring at three levels:

System-level: Is the pipeline running? What is the latency distribution? How many requests are failing? Standard infrastructure monitoring; most teams already know how to do this.

Quality-level: Are the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or…’s outputs getting better or worse over time? This is harder. You need to track something — answer length distribution, user rejection rates, low-confidence responses — even if you can’t run a full RAGAS evaluation suite in production. A modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… that was 87% accurate on the PoC test set and drifts to 71% over six months, silently, is a problem you won’t see coming without this.

Drift detection: The data the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… was trained or configured on will diverge from the data it sees in production. For a RAGA technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context. RAG reduces hallucinations and enables LLMs to… system over Odoo product records, this means new products, updated pricing, changed specifications — content that exists in the database but hasn’t been re-embedded. For a predictive modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… over crm.lead data, it means the pipeline’s featureAn individual measurable property or characteristic of the data used as input to a model. Feature engineering — selecting, transforming, and creating features — is a critical step in the ML pipeline. distribution shifts as sales behaviour changes. Neither is dramatic. Both are real.

Drift doesn’t announce itself. It’s quiet degradation that looks like user complaints before it looks like a metric problem.

Rough effort: 1–2 days for basic system monitoring; 3–5 days for quality-level and drift detection.

Retraining: Static Models in a Moving World

The PoC used a fixed dataset. The modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… — or the retrieval index — was built once and evaluated once. That’s appropriate for a PoC. It’s wrong for production.

In an Odoo context, the problem is concrete:

  • product.template records change: prices update, descriptions improve, products are discontinued
  • knowledge.article records are revised: policies change, procedures are updated, old articles go stale
  • sale.order and account.move data accumulates: the statistical patterns the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… learned in Q1 may not hold in Q3

The retrieval index needs a re-indexing strategy. New or updated records should trigger re-embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic…, or the index should be rebuilt on a schedule. For predictive models, there needs to be a retraining pipeline — even a basic one that runs quarterly against fresh data.

This is not a one-time engineering task. It’s an ongoing operational responsibility. Someone has to own it. If the question “who maintains the AI?” wasn’t answered before go-live, the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… will gradually become wrong in ways that are slow and expensive to diagnose.

We covered the sprint structure that gets a PoC built in the 14-day sprint post. The retraining story is what comes after that sprint ends.

Rough effort: 3–8 days to build a retraining pipeline; ongoing operational cost of 2–4 hours per cycle.

Authorization: Admin Credentials Are a Timebomb

The PoC connected to Odoo with admin credentials because it was faster. Nobody thinks that will make it to production. It usually does.

A production AI system that reads account.move records, hr.payslip data, or crm.lead details needs to respect Odoo’s access control modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or…. Row-level security rules that restrict which records a user can see should restrict what the AI can retrieve on their behalf. An AI that surfaces salary information to a sales rep because it queries with admin-level permissions is a data governance failure, not just a security one.

The fix is straightforward: the AI queries Odoo using the requesting user’s session, not a service account with elevated access. This usually requires switching from a direct PostgreSQL connection (which bypasses Odoo’s access rules) to the Odoo JSON-RPC API with authenticated sessions, or implementing the access check logic explicitly at the application layer.

This matters more as AI systems spread. A single misconfigured agent accessing sensitive records can expose data to users who have no business seeing it. The PoC doesn’t test this, because the PoC runs as the developer.

Rough effort: 1–3 days to implement correct authentication; longer if the PoC was built around direct database access that needs to be rearchitected.

Trust: The Hardest Gap to Close

This one doesn’t have an effort estimate, because it isn’t a single engineering task. It’s the cumulative result of all the others, plus something harder to define.

The demo worked on curated examples. Real users will find every gap. A purchasing manager who uses an invoice matching agent daily will find the supplier reference format it can’t handle. A sales rep querying the product knowledge base will ask questions the retrieval index was never tested against. A finance controller will notice that the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or…’s output was confidently wrong about a VAT rule that changed three months ago.

Every one of these failures erodes trust. And trust, once lost with business users, is very difficult to rebuild. The AI that “got it wrong that one time” becomes the AI nobody uses.

The engineering disciplines above — latency, error handling, monitoring, retraining, authorization — are the prerequisites for trust. But the other ingredient is honesty in the system design: an AI that says “I’m not confident about this” and surfaces a human review path is more trustworthy than one that always produces an answer. A high-confidence wrong answer is worse than no answer.

The systems that earn long-term user trust have a clear answer to the question: “what happens when it’s wrong?” That answer needs to be designed in before go-live, not improvised after.


At Trobz, the PoC-to-production transition is where most of our advisory work happens — mapping these six dimensions to concrete engineering tasks before the stakeholder excitement turns into deployment pressure. If you’ve run a PoC and are now staring at this list, reach out — we can help scope the production build honestly.

Ready to put AI to work?

Let's explore how Trobz AI can automate your processes, enhance your ERP, and help your team make better decisions — faster.