Key Takeaways: Bills of lading (BoLs) are structurally consistent documents, but scan quality, rubber stamps, and mixed-language content make OCR extraction genuinely difficult. A reliable pipeline needs pre-processing, extraction, and validation as three distinct stages — not a single LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… call. Matching extracted data against
res.partnerandproduct.productbefore creating records prevents garbage from entering Inventory. With the right confidence thresholds, you can auto-createstock.pickingrecords for 75–85% of incoming BoLs and route the rest to a human review queue.
Bills of lading arrive daily at every import-heavy operation — dozens, sometimes hundreds, in formats ranging from clean PDFs generated by major shipping lines to paper documents photographed on a dock with a phone. They all contain the same core information: who shipped what, how much, to where. Extracting that reliably and turning it into Odoo Inventory records is the unglamorous backbone of any logistics AI project.
This post walks the full pipeline: document structure, pre-processing, OCR field extraction, Odoo data validation, and automated stock.picking creation. There are real gotchas at every stage.
What a Bill of Lading Actually Contains
Before writing a single line of code, map the document. A standard BoL has a predictable layout — most variation is in formatting, not fields.
The fields you need for Odoo Inventory:
| Field | BoL Section | Odoo Target |
|---|---|---|
| Shipper name / address | Top-left block | res.partner (vendor) |
| Consignee name / address | Top-right block | res.partner (company) |
| Container numbers | Middle section | Reference on stock.picking |
| HS codes / cargo description | Commodity section | product.product matching |
| Gross weight | Measurement section | stock.move.product_uom_qty |
| Port of loading | Header | Note on stock.picking |
| Port of discharge | Header | Destination reference |
| B/L reference number | Top header | stock.picking.origin |
The structure is consistent within each shipping line — Evergreen, COSCO, and MSC all have recognizable templates. Mix three carriers across 50 documents and you have 50 slightly different layouts.
The practical tradeoff: template-based extraction (zone detection by pixel coordinates) works if your shipping lines are consistent and scan quality is high. If either varies, you need the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… extraction layer. Most production environments need both.
Stage 1 — Pre-Processing Before You Touch OCR
Most OCR failures trace back to bad input. A document that looks fine on screen might have 200 DPI resolution, a 3-degree skew, and a rubber stamp obscuring the consignee block.
Pre-process every document before passing it to the OCR engine:
import cv2
import numpy as np
def preprocess_bol_image(image_path: str) -> np.ndarray:
"""Deskew, denoise, and binarize a BoL scan for OCR."""
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# 1. Upscale if resolution is too low (proxy: < 2400px wide for A4 at 300 DPI)
h, w = img.shape
if w < 2400:
scale = 2400 / w
img = cv2.resize(img, None, fx=scale, fy=scale,
interpolation=cv2.INTER_CUBIC)
# 2. Deskew
coords = np.column_stack(np.where(img < 128))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
(h, w) = img.shape
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
img = cv2.warpAffine(img, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
# 3. Adaptive threshold — handles uneven lighting from phone scans
img = cv2.adaptiveThreshold(
img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
# 4. Denoise
img = cv2.fastNlMeansDenoising(img, h=10)
return img
The adaptive threshold step matters more than most tutorials acknowledge. Flat binarization fails on documents where a corner is darker from page curl, or where a stamp bleeds into adjacent text. Adaptive thresholding handles this per-region.
Gotcha — rubber stamps: Vietnamese customs stamps, carrier marks, and “Original” overlays cover text in ways that confuse both traditional OCR and LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… extraction. There’s no clean algorithmic fix. Detect high-ink-density regions and apply confidence penalties for fields extracted from those zones. Factor that into your review routing logic.
Stage 2 — OCR Engine Selection
Engine choice depends on your language mix:
- Tesseract (open source): Good for clean English and Latin-script text. Vietnamese requires the
vielanguage pack, and performance on degraded scans is mediocre. - PaddleOCR (open source): Strong Vietnamese and Chinese performance, runs locally, no data leaves your infrastructure. Worth the setup cost if your documents are Vietnamese or mixed.
- Google Cloud Vision / AWS Textract / Azure Document Intelligence: All handle multi-language and degraded documents better than Tesseract. Textract has shipping-document-specific models. The tradeoff is per-page cost and the fact that your shipping data leaves your servers.
For Vietnamese importers with mixed-language BoLs, PaddleOCR is the right default:
from paddleocr import PaddleOCR
def extract_text_blocks(image: np.ndarray) -> list[dict]:
"""Extract text blocks with bounding boxes and confidence scores."""
ocr = PaddleOCR(use_angle_cls=True, lang='en', show_log=False)
# Use lang='ch' for COSCO/Evergreen docs with Chinese annotations
results = ocr.ocr(image, cls=True)
blocks = []
for line in results[0]:
bbox, (text, confidence) = line
blocks.append({
'text': text,
'confidence': confidence,
'bbox': bbox, # [[x1,y1],[x2,y2],[x3,y3],[x4,y4]]
'y_center': (bbox[0][1] + bbox[2][1]) / 2,
'x_center': (bbox[0][0] + bbox[2][0]) / 2,
})
# Sort spatially: top-to-bottom, left-to-right
return sorted(blocks, key=lambda b: (b['y_center'], b['x_center']))
Drop blocks below 0.6 confidence before passing to extraction. Low-confidence fragments add noise and cause the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… to hallucinate field values to compensate.
Stage 3 — Structured Field Extraction
Clean OCR text is halfway there. Getting it into a consistent JSON structure is where the LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… layer earns its place. Regex-based field matching breaks on layout variation; a well-prompted LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… with a schema constraint handles it far more reliably.
import json
import anthropic
client = anthropic.Anthropic()
BOL_SCHEMA = {
"bl_reference": "string",
"shipper_name": "string",
"consignee_name": "string",
"port_of_loading": "string",
"port_of_discharge": "string",
"container_numbers": ["string"],
"cargo_lines": [
{
"description": "string",
"hs_code": "string | null",
"gross_weight_kg": "number | null",
"packages": "number | null"
}
]
}
def extract_bol_fields(text_blocks: list[dict]) -> dict:
"""Extract structured BoL fields from OCR text using Claude."""
raw_text = "\n".join(
b['text'] for b in text_blocks if b['confidence'] >= 0.6
)
prompt = f"""Extract the following fields from this bill of lading text.
Return valid JSON matching this schema exactly:
{json.dumps(BOL_SCHEMA, indent=2)}
If a field is not found, use null. Do not guess. Do not infer missing values.
BILL OF LADING TEXT:
{raw_text}
"""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return {}
The instruction “if a field is not found, use null. Do not guess.” is doing real work here. Without it, models fill gaps with plausible-sounding values — which is worse than a null because the error is invisible.
Gotcha — HS codes: These are frequently misread. The OCR will transpose adjacent characters in degraded scans (8471.30 becomes 8741.30). Run every extracted HS code through a checksum validator against the official HS code database before using it for product matching. Reject and flag rather than accept a corrupted code.
Stage 4 — Validating Against Odoo Data
Extracted text is not Odoo data. Before creating any records, validate against what already exists in the database.
Partner matching (res.partner):
import xmlrpc.client
def match_partner(
name: str, odoo_url: str, db: str, uid: int, password: str
) -> int | None:
"""Match an extracted name against res.partner. Returns ID or None."""
models = xmlrpc.client.ServerProxy(f'{odoo_url}/xmlrpc/2/object')
partner_ids = models.execute_kw(
db, uid, password, 'res.partner', 'search',
[[['name', 'ilike', name], ['active', '=', True]]]
)
if len(partner_ids) == 1:
return partner_ids[0]
# Zero or multiple matches — route to human review
return None
Don’t create new res.partner records automatically. A BoL might say “ACME TRADING CO.” when Odoo has “Acme Trading Company Ltd.” — creating a duplicate vendor is worse than routing to a review queue. Let the matching failures accumulate in your review queue and batch-resolve them.
Product matching (product.product):
HS codes are your most reliable key when they’re clean. Cargo description matching is messier — “POLYETHYLENE TEREPHTHALATE RESIN” and “PET RESIN” are the same product, but string similarity alone won’t catch it. Vector similarity on product names against product.product.name handles this better — the field_vector OCA module makes this straightforward if you’ve already indexed your product catalogue.
Strategy: exact HS code match first; fall back to vector similarity on description; if neither clears a confidence threshold, flag for review.
Stage 5 — Creating stock.picking Records
With validated partner IDs and product IDs, create the receipt record via JSON-RPC:
def create_stock_picking(
odoo_url: str, db: str, uid: int, password: str,
extracted: dict,
vendor_id: int,
products: list[dict], # [{'product_id': int, 'qty': float, 'uom_id': int, 'description': str}]
picking_type_id: int, # Incoming shipment picking type ID
location_id: int, # Vendor location
location_dest_id: int, # Destination warehouse location
) -> int:
"""Create a draft stock.picking from validated BoL data."""
models = xmlrpc.client.ServerProxy(f'{odoo_url}/xmlrpc/2/object')
move_lines = [
(0, 0, {
'name': p.get('description', '/'),
'product_id': p['product_id'],
'product_uom_qty': p['qty'],
'product_uom': p['uom_id'],
'location_id': location_id,
'location_dest_id': location_dest_id,
})
for p in products
]
container_list = ', '.join(extracted.get('container_numbers', []))
picking_id = models.execute_kw(
db, uid, password, 'stock.picking', 'create', [{
'partner_id': vendor_id,
'picking_type_id': picking_type_id,
'location_id': location_id,
'location_dest_id': location_dest_id,
'origin': extracted.get('bl_reference', ''),
'note': f"Auto-created from BoL OCR. Containers: {container_list}",
'move_ids_without_package': move_lines,
}]
)
return picking_id
The stock.picking is created in draft state (state = 'draft'). Don’t validate it programmatically. The warehouse team confirms receipt against physical goods before the picking moves to done — bypassing that step creates inventory discrepancies that are painful to unwind.
Putting the Pipeline Together
The full flow:
- Ingest — BoL arrives via email attachment or uploaded to an Odoo record
- Pre-process — deskew, threshold, denoise
- OCR — extract text blocks with spatial coordinates and confidence scores
- Extract — LLMA neural network trained on vast amounts of text data to understand and generate human language. LLMs use the Transformer architecture and can perform a wide range of tasks — summarization,… produces structured JSON from OCR text
- Validate — match against
res.partnerandproduct.product; compute overall confidence - Route — high-confidence results auto-create draft
stock.picking; the rest go to a review queue - Confirm — warehouse team validates the draft picking against physical goods
At a threshold of 0.80+, expect 75–85% of clean PDFs from known shipping lines to clear the auto-creation step. Scanned paper documents from smaller carriers will have a higher failure rate — budget for it.
This pipeline isn’t set-and-forget. Shipping lines update BoL templates. New carriers appear. HS code databases change. Instrument your review queue to track which document types fail most often, and plan a quarterly pass over extraction accuracy.
At Trobz, we’ve deployed this pipeline for logistics and import businesses operating in Vietnam — the details above reflect what holds up in production. If you’re building something similar and want to compare notes on the Vietnamese customs document layer specifically, reach out.