Building an OCR Pipeline for Bills of Lading That Feeds Odoo Inventory

Bills of lading arrive in dozens of formats, but your Odoo Inventory needs clean, structured data. Here's how to build a pipeline that goes from PDF scan to stock.picking automatically.

Key Takeaways: Bills of lading (BoLs) are structurally consistent documents, but scan quality, rubber stamps, and mixed-language content make OCR extraction genuinely difficult. A reliable pipeline needs pre-processing, extraction, and validation as three distinct stages — not a single LLM call. Matching extracted data against res.partner and product.product before creating records prevents garbage from entering Inventory. With the right confidence thresholds, you can auto-create stock.picking records for 75–85% of incoming BoLs and route the rest to a human review queue.

Bills of lading arrive daily at every import-heavy operation — dozens, sometimes hundreds, in formats ranging from clean PDFs generated by major shipping lines to paper documents photographed on a dock with a phone. They all contain the same core information: who shipped what, how much, to where. Extracting that reliably and turning it into Odoo Inventory records is the unglamorous backbone of any logistics AI project.

This post walks the full pipeline: document structure, pre-processing, OCR field extraction, Odoo data validation, and automated stock.picking creation. There are real gotchas at every stage.

What a Bill of Lading Actually Contains

Before writing a single line of code, map the document. A standard BoL has a predictable layout — most variation is in formatting, not fields.

The fields you need for Odoo Inventory:

Field	BoL Section	Odoo Target
Shipper name / address	Top-left block	`res.partner` (vendor)
Consignee name / address	Top-right block	`res.partner` (company)
Container numbers	Middle section	Reference on `stock.picking`
HS codes / cargo description	Commodity section	`product.product` matching
Gross weight	Measurement section	`stock.move.product_uom_qty`
Port of loading	Header	Note on `stock.picking`
Port of discharge	Header	Destination reference
B/L reference number	Top header	`stock.picking.origin`

The structure is consistent within each shipping line — Evergreen, COSCO, and MSC all have recognizable templates. Mix three carriers across 50 documents and you have 50 slightly different layouts.

The practical tradeoff: template-based extraction (zone detection by pixel coordinates) works if your shipping lines are consistent and scan quality is high. If either varies, you need the LLM extraction layer. Most production environments need both.

Stage 1 — Pre-Processing Before You Touch OCR

Most OCR failures trace back to bad input. A document that looks fine on screen might have 200 DPI resolution, a 3-degree skew, and a rubber stamp obscuring the consignee block.

Pre-process every document before passing it to the OCR engine:

import cv2
import numpy as np

def preprocess_bol_image(image_path: str) -> np.ndarray:
    """Deskew, denoise, and binarize a BoL scan for OCR."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # 1. Upscale if resolution is too low (proxy: < 2400px wide for A4 at 300 DPI)
    h, w = img.shape
    if w < 2400:
        scale = 2400 / w
        img = cv2.resize(img, None, fx=scale, fy=scale,
                         interpolation=cv2.INTER_CUBIC)

    # 2. Deskew
    coords = np.column_stack(np.where(img < 128))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    (h, w) = img.shape
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    img = cv2.warpAffine(img, M, (w, h),
                          flags=cv2.INTER_CUBIC,
                          borderMode=cv2.BORDER_REPLICATE)

    # 3. Adaptive threshold — handles uneven lighting from phone scans
    img = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # 4. Denoise
    img = cv2.fastNlMeansDenoising(img, h=10)

    return img

The adaptive threshold step matters more than most tutorials acknowledge. Flat binarization fails on documents where a corner is darker from page curl, or where a stamp bleeds into adjacent text. Adaptive thresholding handles this per-region.

Gotcha — rubber stamps: Vietnamese customs stamps, carrier marks, and “Original” overlays cover text in ways that confuse both traditional OCR and LLM extraction. There’s no clean algorithmic fix. Detect high-ink-density regions and apply confidence penalties for fields extracted from those zones. Factor that into your review routing logic.

Stage 2 — OCR Engine Selection

Engine choice depends on your language mix:

Tesseract (open source): Good for clean English and Latin-script text. Vietnamese requires the vie language pack, and performance on degraded scans is mediocre.
PaddleOCR (open source): Strong Vietnamese and Chinese performance, runs locally, no data leaves your infrastructure. Worth the setup cost if your documents are Vietnamese or mixed.
Google Cloud Vision / AWS Textract / Azure Document Intelligence: All handle multi-language and degraded documents better than Tesseract. Textract has shipping-document-specific models. The tradeoff is per-page cost and the fact that your shipping data leaves your servers.

For Vietnamese importers with mixed-language BoLs, PaddleOCR is the right default:

from paddleocr import PaddleOCR

def extract_text_blocks(image: np.ndarray) -> list[dict]:
    """Extract text blocks with bounding boxes and confidence scores."""
    ocr = PaddleOCR(use_angle_cls=True, lang='en', show_log=False)
    # Use lang='ch' for COSCO/Evergreen docs with Chinese annotations

    results = ocr.ocr(image, cls=True)

    blocks = []
    for line in results[0]:
        bbox, (text, confidence) = line
        blocks.append({
            'text': text,
            'confidence': confidence,
            'bbox': bbox,  # [[x1,y1],[x2,y2],[x3,y3],[x4,y4]]
            'y_center': (bbox[0][1] + bbox[2][1]) / 2,
            'x_center': (bbox[0][0] + bbox[2][0]) / 2,
        })

    # Sort spatially: top-to-bottom, left-to-right
    return sorted(blocks, key=lambda b: (b['y_center'], b['x_center']))

Drop blocks below 0.6 confidence before passing to extraction. Low-confidence fragments add noise and cause the LLM to hallucinate field values to compensate.

Stage 3 — Structured Field Extraction

Clean OCR text is halfway there. Getting it into a consistent JSON structure is where the LLM layer earns its place. Regex-based field matching breaks on layout variation; a well-prompted LLM with a schema constraint handles it far more reliably.

import json
import anthropic

client = anthropic.Anthropic()

BOL_SCHEMA = {
    "bl_reference": "string",
    "shipper_name": "string",
    "consignee_name": "string",
    "port_of_loading": "string",
    "port_of_discharge": "string",
    "container_numbers": ["string"],
    "cargo_lines": [
        {
            "description": "string",
            "hs_code": "string | null",
            "gross_weight_kg": "number | null",
            "packages": "number | null"
        }
    ]
}

def extract_bol_fields(text_blocks: list[dict]) -> dict:
    """Extract structured BoL fields from OCR text using Claude."""
    raw_text = "\n".join(
        b['text'] for b in text_blocks if b['confidence'] >= 0.6
    )

    prompt = f"""Extract the following fields from this bill of lading text.
Return valid JSON matching this schema exactly:
{json.dumps(BOL_SCHEMA, indent=2)}

If a field is not found, use null. Do not guess. Do not infer missing values.

BILL OF LADING TEXT:
{raw_text}
"""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {}

The instruction “if a field is not found, use null. Do not guess.” is doing real work here. Without it, models fill gaps with plausible-sounding values — which is worse than a null because the error is invisible.

Gotcha — HS codes: These are frequently misread. The OCR will transpose adjacent characters in degraded scans (8471.30 becomes 8741.30). Run every extracted HS code through a checksum validator against the official HS code database before using it for product matching. Reject and flag rather than accept a corrupted code.

Stage 4 — Validating Against Odoo Data

Extracted text is not Odoo data. Before creating any records, validate against what already exists in the database.

Partner matching (res.partner):

import xmlrpc.client

def match_partner(
    name: str, odoo_url: str, db: str, uid: int, password: str
) -> int | None:
    """Match an extracted name against res.partner. Returns ID or None."""
    models = xmlrpc.client.ServerProxy(f'{odoo_url}/xmlrpc/2/object')

    partner_ids = models.execute_kw(
        db, uid, password, 'res.partner', 'search',
        [[['name', 'ilike', name], ['active', '=', True]]]
    )

    if len(partner_ids) == 1:
        return partner_ids[0]

    # Zero or multiple matches — route to human review
    return None

Don’t create new res.partner records automatically. A BoL might say “ACME TRADING CO.” when Odoo has “Acme Trading Company Ltd.” — creating a duplicate vendor is worse than routing to a review queue. Let the matching failures accumulate in your review queue and batch-resolve them.

Product matching (product.product):

HS codes are your most reliable key when they’re clean. Cargo description matching is messier — “POLYETHYLENE TEREPHTHALATE RESIN” and “PET RESIN” are the same product, but string similarity alone won’t catch it. Vector similarity on product names against product.product.name handles this better — the field_vector OCA module makes this straightforward if you’ve already indexed your product catalogue.

Strategy: exact HS code match first; fall back to vector similarity on description; if neither clears a confidence threshold, flag for review.

Stage 5 — Creating `stock.picking` Records

With validated partner IDs and product IDs, create the receipt record via JSON-RPC:

def create_stock_picking(
    odoo_url: str, db: str, uid: int, password: str,
    extracted: dict,
    vendor_id: int,
    products: list[dict],  # [{'product_id': int, 'qty': float, 'uom_id': int, 'description': str}]
    picking_type_id: int,  # Incoming shipment picking type ID
    location_id: int,      # Vendor location
    location_dest_id: int, # Destination warehouse location
) -> int:
    """Create a draft stock.picking from validated BoL data."""
    models = xmlrpc.client.ServerProxy(f'{odoo_url}/xmlrpc/2/object')

    move_lines = [
        (0, 0, {
            'name': p.get('description', '/'),
            'product_id': p['product_id'],
            'product_uom_qty': p['qty'],
            'product_uom': p['uom_id'],
            'location_id': location_id,
            'location_dest_id': location_dest_id,
        })
        for p in products
    ]

    container_list = ', '.join(extracted.get('container_numbers', []))

    picking_id = models.execute_kw(
        db, uid, password, 'stock.picking', 'create', [{
            'partner_id': vendor_id,
            'picking_type_id': picking_type_id,
            'location_id': location_id,
            'location_dest_id': location_dest_id,
            'origin': extracted.get('bl_reference', ''),
            'note': f"Auto-created from BoL OCR. Containers: {container_list}",
            'move_ids_without_package': move_lines,
        }]
    )

    return picking_id

The stock.picking is created in draft state (state = 'draft'). Don’t validate it programmatically. The warehouse team confirms receipt against physical goods before the picking moves to done — bypassing that step creates inventory discrepancies that are painful to unwind.

Putting the Pipeline Together

The full flow:

Ingest — BoL arrives via email attachment or uploaded to an Odoo record
Pre-process — deskew, threshold, denoise
OCR — extract text blocks with spatial coordinates and confidence scores
Extract — LLM produces structured JSON from OCR text
Validate — match against res.partner and product.product; compute overall confidence
Route — high-confidence results auto-create draft stock.picking; the rest go to a review queue
Confirm — warehouse team validates the draft picking against physical goods

At a threshold of 0.80+, expect 75–85% of clean PDFs from known shipping lines to clear the auto-creation step. Scanned paper documents from smaller carriers will have a higher failure rate — budget for it.

This pipeline isn’t set-and-forget. Shipping lines update BoL templates. New carriers appear. HS code databases change. Instrument your review queue to track which document types fail most often, and plan a quarterly pass over extraction accuracy.

At Trobz, we’ve deployed this pipeline for logistics and import businesses operating in Vietnam — the details above reflect what holds up in production. If you’re building something similar and want to compare notes on the Vietnamese customs document layer specifically, reach out.