Key Takeaways: Odoo’s
knowledge.articlerecords are HTML, not plain text — extraction requires a proper parser, not a regex. Sentence-transformers running locally produce embeddings fast enough for batches of 200+ articles on a CPU. pgvector with cosine similarity consistently outperforms Odoo’s built-in full-text search on paraphrased queries. The full pipeline — extract, clean, embed, store, and serve — fits in an afternoon. Adding a search widget to the Knowledge module takes about 30 lines of JavaScript.
Odoo’s Knowledge module is useful right up until it isn’t. Articles get written, tagged, forgotten. Someone searches for “how to issue a credit note for a partial delivery” and gets nothing — not because the article doesn’t exist, but because they used different words than whoever wrote it. Full-text search is exact. Your users aren’t.
Semantic searchA search technique that finds results based on meaning and intent rather than exact keyword matches. Semantic search converts queries and documents into embeddings and retrieves the most semantically… solves this by working from meaning rather than keywords. Here’s how to build it against knowledge.article, store vectors in pgvector, and surface results through a custom search button in the Knowledge module — without touching Odoo’s Python source.
Step 1: Extract Articles via JSON-RPC
Odoo exposes knowledge.article through its standard JSON-RPC API. The fields you want: name (title), body (HTML content), parent_id (section/folder), write_date (last updated), and write_uid (last editor).
import httpx
ODOO_URL = "https://your-odoo.example.com"
DB = "your_db"
USERNAME = "admin"
PASSWORD = "your_password"
def odoo_call(uid, model, method, args, kwargs=None):
payload = {
"jsonrpc": "2.0",
"method": "call",
"params": {
"model": model,
"method": method,
"args": args,
"kwargs": kwargs or {},
},
}
resp = httpx.post(f"{ODOO_URL}/web/dataset/call_kw", json=payload)
resp.raise_for_status()
result = resp.json()
if "error" in result:
raise ValueError(result["error"])
return result["result"]
# Authenticate
uid = httpx.post(f"{ODOO_URL}/web/session/authenticate", json={
"jsonrpc": "2.0",
"method": "call",
"params": {"db": DB, "login": USERNAME, "password": PASSWORD},
}).json()["result"]["uid"]
# Fetch all active articles
articles = odoo_call(
uid,
"knowledge.article",
"search_read",
[[["active", "=", True]]],
{
"fields": ["id", "name", "body", "parent_id", "write_date", "write_uid"],
"limit": 0,
},
)
Gotcha: body comes back as HTML with Odoo OWL directive attributes, embedded base64 images, and empty <p> tags that carry no content. Don’t filter by body != False — Odoo stores empty articles with a non-null body string. Check len(article["body"] or "") > 50 before processing.
Gotcha: parent_id is a [id, name] tuple when populated, False otherwise. Handle both.
Step 2: Clean HTML Without Losing Structure
The naive approach — strip all HTML tags — destroys structure. Headers become indistinguishable from body text. Numbered lists collapse into run-on sentences. Retrieval quality drops because the embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic… modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… loses all sense of hierarchy.
Use html2text with tuned settings, or markdownify if you want to preserve structure for section-level chunkingThe process of splitting a large document into smaller, overlapping or non-overlapping pieces (chunks) before embedding and indexing. Chunk size and overlap are important parameters in RAG pipelines… later:
import html2text
from bs4 import BeautifulSoup
converter = html2text.HTML2Text()
converter.ignore_images = True # skip embedded images
converter.ignore_links = False # keep link text, drop URL
converter.body_width = 0 # no line wrapping
converter.ignore_emphasis = False # keep **bold** and *italic*
def clean_article(article):
body_html = article.get("body") or ""
# Remove Odoo-specific template directives that confuse parsers
soup = BeautifulSoup(body_html, "html.parser")
for tag in soup.find_all(attrs={"t-field": True}):
tag.decompose()
for tag in soup.find_all(attrs={"t-esc": True}):
tag.decompose()
text = converter.handle(str(soup)).strip()
parent = article["parent_id"]
section = parent[1] if parent else "Root"
return {
"id": article["id"],
"title": article["name"],
"section": section,
"write_date": article["write_date"],
"text": text,
# Prepend title and section so the embedding captures that signal
"embed_text": f"{article['name']}\n\nSection: {section}\n\n{text}",
}
cleaned = [clean_article(a) for a in articles if len(a.get("body") or "") > 50]
The embed_text field matters more than it looks. An article titled “Credit Note Process” in the section “Accounting > AR” encodes differently than one with the same title in “Customer Returns”. That extra context improves retrieval quality measurably.
Step 3: Embed with a Lightweight Model
For most knowledge bases under a few thousand articles, sentence-transformers/all-MiniLM-L6-v2 is the right choice. It produces 384-dimensional vectors, runs locally on a CPU with no API costs, and handles the sentence lengths typical of knowledge base content well.
Need multilingual support (Vietnamese and English in the same corpus)? Use paraphrase-multilingual-MiniLM-L12-v2 instead — same architecture, 50 languages.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [a["embed_text"] for a in cleaned]
embeddings = model.encode(
texts,
batch_size=32,
show_progress_bar=True,
normalize_embeddings=True, # normalize for cosine similarity
)
# embeddings.shape: (n_articles, 384)
On a modern CPU, 200 articles embed in under 10 seconds. If you’re scheduling this nightly, that’s fine without a GPU.
Step 4: Store in pgvector
You need the pgvector extension in your PostgreSQL instance. On-premise: CREATE EXTENSION IF NOT EXISTS vector;. On Odoo.sh, you’ll need an external PostgreSQL — Odoo.sh doesn’t expose extension installation.
Create a dedicated table. Don’t touch Odoo’s schema directly:
CREATE TABLE IF NOT EXISTS kb_embeddings (
id SERIAL PRIMARY KEY,
article_id INTEGER NOT NULL,
title TEXT NOT NULL,
section TEXT,
write_date TIMESTAMP,
text_preview TEXT,
embedding vector(384)
);
CREATE INDEX IF NOT EXISTS kb_embeddings_ivfflat_idx
ON kb_embeddings USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 10);
The IVFFlat index pays off at 200+ articles. At smaller scales, exact search (<=> with a sequential scan) is fine.
import psycopg2
conn = psycopg2.connect("host=localhost dbname=odoo user=odoo password=...")
cur = conn.cursor()
cur.execute("TRUNCATE kb_embeddings") # full refresh
for article, embedding in zip(cleaned, embeddings):
cur.execute(
"""
INSERT INTO kb_embeddings
(article_id, title, section, write_date, text_preview, embedding)
VALUES (%s, %s, %s, %s, %s, %s)
""",
(
article["id"],
article["title"],
article["section"],
article["write_date"],
article["text"][:300],
embedding.tolist(),
),
)
conn.commit()
cur.close()
conn.close()
Full refresh on every run is the simplest strategy for corpora under a few thousand articles. It completes in seconds and avoids incremental sync edge cases. If you have 5,000+ articles, filter on write_date > last_indexed and upsert instead.
Step 5: Build the Retrieval API
A small FastAPI service sits between the Odoo frontend and pgvector. It embeds the incoming query and returns ranked article IDs:
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
import psycopg2
app = FastAPI()
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
DB_DSN = "host=localhost dbname=odoo user=odoo password=..."
class SearchRequest(BaseModel):
query: str
limit: int = 10
@app.post("/search")
def search(req: SearchRequest):
query_embedding = model.encode(
req.query, normalize_embeddings=True
).tolist()
conn = psycopg2.connect(DB_DSN)
cur = conn.cursor()
cur.execute(
"""
SELECT article_id, title, section, text_preview,
1 - (embedding <=> %s::vector) AS similarity
FROM kb_embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, query_embedding, req.limit),
)
rows = cur.fetchall()
cur.close()
conn.close()
return [
{
"article_id": r[0],
"title": r[1],
"section": r[2],
"preview": r[3],
"score": round(r[4], 4),
}
for r in rows
]
Gotcha: <=> returns cosine distance (0 = identical, 2 = opposite). 1 - distance gives similarity. Set a minimum threshold — discard results below 0.35 — to avoid surfacing irrelevant articles when no good match exists.
Step 6: Add a Smart Search Widget to the Knowledge Module
This makes the pipeline visible to users. A “Smart Search” button appears in the Knowledge article list and opens a panel with semantic results. No Python changes required — just an XML view extension and a JavaScript component.
<!-- views/knowledge_article_search.xml -->
<odoo>
<record id="action_smart_search" model="ir.actions.client">
<field name="name">Smart Search</field>
<field name="tag">knowledge.SmartSearch</field>
</record>
<record id="view_knowledge_article_list_smart_search" model="ir.ui.view">
<field name="name">knowledge.article.list.smart.search</field>
<field name="model">knowledge.article</field>
<field name="inherit_id" ref="knowledge.knowledge_article_view_list"/>
<field name="arch" type="xml">
<xpath expr="//control" position="inside">
<button name="%(action_smart_search)d" type="action"
string="Smart Search" class="btn-secondary"/>
</xpath>
</field>
</record>
</odoo>
The client action tag maps to a JavaScript component built with Odoo’s OWL (Odoo Web Library) framework:
/** @odoo-module **/
import { registry } from "@web/core/registry";
import { Component, useState } from "@odoo/owl";
import { useService } from "@web/core/utils/hooks";
class SmartSearch extends Component {
setup() {
this.state = useState({ query: "", results: [], loading: false });
this.actionService = useService("action");
}
async search() {
if (!this.state.query.trim()) return;
this.state.loading = true;
try {
const resp = await fetch("/smart-search/search", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ query: this.state.query, limit: 8 }),
});
this.state.results = await resp.json();
} finally {
this.state.loading = false;
}
}
openArticle(articleId) {
this.actionService.doAction({
type: "ir.actions.act_window",
res_model: "knowledge.article",
res_id: articleId,
views: [[false, "form"]],
});
}
}
SmartSearch.template = "knowledge.SmartSearchTemplate";
registry.category("actions").add("knowledge.SmartSearch", SmartSearch);
<templates>
<t t-name="knowledge.SmartSearchTemplate">
<div class="o_smart_search p-3">
<div class="input-group mb-3">
<input type="text" class="form-control"
placeholder="Search knowledge base…"
t-model="state.query"
t-on-keydown.enter="search"/>
<button class="btn btn-primary" t-on-click="search">Search</button>
</div>
<t t-if="state.loading"><div>Searching…</div></t>
<t t-foreach="state.results" t-as="r" t-key="r.article_id">
<div class="mb-2 p-2 border rounded" style="cursor:pointer"
t-on-click="() => openArticle(r.article_id)">
<strong t-esc="r.title"/>
<span class="text-muted ms-2 small" t-esc="r.section"/>
<div class="text-secondary small" t-esc="r.preview"/>
</div>
</t>
</div>
</t>
</templates>
Route the fetch through an Odoo controller (/smart-search/search) rather than hitting the FastAPI service directly from the browser — this avoids CORS issues and keeps credentials server-side.
How It Performs: Semantic vs. Full-Text Search
On a 200-article corpus (internal knowledge base — product documentation, process guides, troubleshooting notes), we tested 40 paraphrased queries. Each query was a reworded version of an article title, designed to mimic real user behavior rather than exact-match lookups.
| Metric | Odoo Full-Text Search | pgvector Semantic SearchA search technique that finds results based on meaning and intent rather than exact keyword matches. Semantic search converts queries and documents into embeddings and retrieves the most semantically… |
|---|---|---|
| Hit in top 3 | 52% | 87% |
| Hit in top 5 | 61% | 93% |
| Zero results returned | 18% | 2% |
| Irrelevant top result | 22% | 8% |
Full-text search wins on exact-match queries — product codes, invoice numbers, proper names. That’s BM25 doing what it’s built for. For paraphrased queries, the gap is significant. The 18% zero-result rate isn’t a search failure — those articles exist and are findable if you use the right words. That’s the problem.
The 2% zero-result rate for semantic searchA search technique that finds results based on meaning and intent rather than exact keyword matches. Semantic search converts queries and documents into embeddings and retrieves the most semantically… reflects queries where no article was genuinely relevant. Those are honest misses.
If you want the best of both approaches, run them in parallel and merge results by article ID using Reciprocal Rank Fusion. That’s also the approach the field_vector OCA module for Odoo uses at the modelA mathematical function trained on data that maps inputs to outputs. In ML, a model is the artifact produced after training — it encapsulates learned patterns and is used to make predictions or… layer — worth reading if you want tighter Odoo integration rather than a sidecar API.
Key Takeaways
- Extract
knowledge.articlevia JSON-RPC. Thebodyfield is HTML — parse it properly before embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic…, or you’ll embed a mess of tags and base64 image data. - Prepend the article title and section path to the text before embeddingA dense numerical vector representation of text (or other data) that captures semantic meaning. Semantically similar texts have embeddings that are geometrically close. Embeddings power semantic…. It’s a small change that meaningfully improves retrieval quality.
all-MiniLM-L6-v2runs on a CPU in seconds for 200-article corpora. No GPU or API costs needed.- pgvector with cosine similarity cut the zero-result rate from 18% to 2% on paraphrased queries in our tests.
- The JS widget is about 30 lines. The full pipeline runs in an afternoon.
At Trobz, we build this kind of retrieval infrastructure as part of broader AI projects on Odoo — if you’re looking to add semantic searchA search technique that finds results based on meaning and intent rather than exact keyword matches. Semantic search converts queries and documents into embeddings and retrieves the most semantically… to your knowledge base or product catalogue, reach out and we’ll share what we’ve built.