Vietnamese Vendor Contracts: The Corner Cases That Break Every OCR Pipeline

After processing 3,000+ Vietnamese B2B contracts, the clean PDFs were never the problem. Here's what actually broke our OCR pipeline — and what we did about it.

Key Takeaways: Vietnamese B2B contracts fail OCR pipelines in ways that Western document AI systems simply aren’t built for. Bilingual clauses, handwritten annotations that legally override typed text, red seals obscuring key fields, provincial tax code variants, and non-standard payment phrasing all cluster in the same documents — the hard ones. Each failure mode needs a different fix. Some are preprocessing problems. Some require model-level language handling. A few require human review as a permanent fixture, not a temporary workaround.

After processing over 3,000 Vietnamese vendor contracts — purchase.order source documents flowing into an Odoo-based procurement workflow — we stopped counting the ways a well-tuned OCR pipeline could silently fail. The easy documents were fine. The hard documents were not randomly distributed. They came in clusters, from specific supplier categories, during specific periods. That pattern matters.

This is a field report. Each section describes what we saw, why it broke, and what actually fixed it.

Bilingual Clauses: When the Paragraph Is Half Vietnamese, Half English

What happened. A significant share of contracts from foreign-invested enterprises (FIEs) and joint ventures contain clauses written in both Vietnamese and English within the same paragraph — not in parallel columns, but interleaved. A payment clause might read: “Bên mua sẽ thanh toán trong vòng 30 (thirty) days from the date of invoice…” The numeric term appears in English inside a Vietnamese sentence, followed by an English parenthetical.

Standard OCR models, even ones with multilingual support, treat language detection as a document-level or paragraph-level classification. When a paragraph is ambiguous — mixed script, mixed vocabulary — language detection degrades. Tesseract, configured for Vietnamese (vie), would occasionally misread English numerals and dates when the surrounding text was dense Vietnamese. Switching to English mode (eng) fixed the numerals but mangled the tonal diacritics.

Why it broke. The root cause isn’t character recognition — modern OCR handles Vietnamese diacritics reasonably well on clean scans. The problem is tokenisation and normalization downstream. When a downstream NER (Named Entity Recognition) model tries to extract “payment terms”, it expects either a Vietnamese phrase or an English phrase. A hybrid phrase triggers ambiguous confidence scores from both models, and the extraction pipeline would often drop the field entirely rather than output a low-confidence value.

The fix. We switched from language-level OCR mode selection to character-level Unicode block analysis. Before sending text to the NER model, a preprocessing step scans each extracted line and tags it with its dominant script (Latin vs. Vietnamese extended Latin). Payment term extraction now accepts hybrid-format fields explicitly — a regex that matches Vietnamese payment phrasing followed by an Arabic numeral regardless of surrounding language context. Accuracy on bilingual clauses improved from 61% to 89% after this change.

Handwritten Annotations That Override Typed Text

What happened. Vietnamese B2B contracts frequently include handwritten annotations in the margins or directly over typed fields — a practice that carries legal weight. A typed payment term of “45 days” struck through by hand and overwritten with “30 ngày” (30 days) is legally the 30-day version. Our pipeline was reading the typed text, ignoring the annotation, and populating purchase.order payment terms with the superseded value.

This wasn’t an edge case. Roughly 22% of the contracts in our dataset had at least one handwritten annotation. Of those, about 40% were payment or pricing modifications — the fields that matter most for accounts payable reconciliation against account.move.

Why it broke. Off-the-shelf OCR models are optimised for the dominant text layer. Handwriting on top of printed text creates a composite image where the two layers compete for pixel classification. The printed text, being darker and more regular, tends to win. Handwriting is treated as noise. Most document AI products — including major cloud vision APIs — do not surface “this field was annotated” as a signal. They just give you text confidence scores, and the typed version often scores higher.

The fix. We added a handwriting detection pre-pass using a separate lightweight classifier trained to flag whether a scanned page contains handwritten strokes overlaid on printed text. When the classifier fires, those pages are routed to a separate extraction pipeline that uses a vision-language model (we tested GPT-4o and Gemini 1.5 Pro; both handled this better than any text-based OCR approach). The model is prompted explicitly: “If any typed fields appear to have been crossed out or overwritten by hand, extract the handwritten version.” Human review is still triggered for annotated amount fields above a defined threshold — that’s not conservatism, it’s risk management. A wrongly extracted payment term on a large contract is expensive to unwind.

Seals Obscuring Key Fields

What happened. Vietnamese business documents require official seals (con dấu) — circular, red ink stamps placed on signed pages. The placement isn’t standardised. Procurement contracts will typically have the seal stamped on the final signature page, but we consistently saw seals landing directly on top of the vendor’s tax code, bank account number, or contract value. Red ink on printed black text creates a colour collision that confuses contrast-based OCR preprocessing.

On some documents, the seal covered the entire tax code field. The pipeline extracted partial text and passed it through — not flagged as missing, just wrong.

Why it broke. Standard binarization (converting the scanned image to black-and-white before OCR) treats red ink as dark content. When a red seal overlays black text, the binarised image has two overlapping dark regions — the underlying text characters and the circular seal outline and interior text. Tesseract and similar engines struggle to separate them. The result is garbled character extraction: extracted text that looks like letters but isn’t valid data.

The fix. Two changes helped significantly. First, we added colour-space preprocessing: before binarisation, we apply a HSV filter to remove pixels in the red range (roughly H: 0–10 and 160–180 in OpenCV terms). This removes most of the seal without destroying the underlying text, since the text is black. Second, we added post-extraction validation rules for tax codes specifically. Vietnamese tax codes (mã số thuế) follow a defined format — 10 or 13 digits, with the first two digits being the province code. Any extracted tax code that fails format validation triggers a reprocessing pass with the colour filter applied, then human review if it fails again. This reduced seal-related extraction failures by about 70%.

Non-Standard Payment Term Phrasing

What happened. The canonical Vietnamese expression for payment terms isn’t fixed. Vietnamese B2B contracts use variants that carry the same legal meaning but differ enough in surface form to defeat keyword-based extraction:

“30 ngày kể từ ngày nhận hóa đơn” — 30 days from the date of receiving the invoice
“30 ngày làm việc” — 30 working days (significantly different from 30 calendar days)
“trong vòng một tháng” — within one month
“không quá 45 ngày” — not exceeding 45 days
“thanh toán trước” — advance payment (no term, payment up front)

A regex designed to extract “N days” would catch the first case, miss the second (and misclassify it if it extracted the number without the “làm việc” qualifier), and fail entirely on the phrase-based cases. Mapping “một tháng” (one month) to 30 days is semantically correct most of the time — but not in contracts that define a month as the calendar period, which occasionally differs from 30 days in practice.

Why it broke. Payment term extraction in most document AI pipelines is designed around English or European contract conventions. The Vietnamese equivalents aren’t in the training data for the NER models we were using. Beyond that, the distinction between calendar days and working days is critical for account.move due date calculation in Odoo — getting this wrong shifts payment deadlines by 6–10 days on any week with public holidays.

The fix. We built a Vietnamese payment term normaliser — a small lookup table combined with a lightweight language model classifier. The lookup table handles the most common surface forms. The classifier handles everything else, trained on ~500 labelled examples extracted from our contract corpus. Output is a structured object: {type: "net", days: 30, basis: "working_days"} or {type: "advance"}. This structured output maps cleanly to Odoo’s payment term configuration on account.payment.term. The classifier runs in under 50ms per clause on CPU, so latency isn’t an issue.

For the genuinely ambiguous cases — “một tháng” in a contract with no explicit day-count definition — we flag for human review. About 3% of documents fall into that category. That’s acceptable.

Provincial Tax Code Variants

What happened. Vietnamese tax codes (MST) begin with a two-digit province code. Ho Chi Minh City codes start with 03; Hanoi starts with 01. But suppliers registered in industrial zones, export processing zones, or recently reorganised provinces sometimes have codes that don’t match the current official province mapping. A supplier physically operating in Binh Duong might have a tax code registered when Binh Duong was still part of Song Be province — different prefix, same current address.

Our validation layer was checking extracted tax codes against the current official province code list. It was rejecting valid tax codes from older registrations as extraction errors.

Why it broke. The official tax code format documentation we used was current. The contracts we were processing weren’t. Vietnam has reorganised provinces multiple times since 2000. Suppliers incorporated before those changes have legacy codes that are still legally valid and still in use. The gap between “current official format” and “historically valid format” caught us.

The fix. We expanded the tax code validation ruleset to include historical province codes from major reorganisation events (1997, 2004, 2008). More importantly, we added a cross-reference check: when an extracted tax code fails format validation, we query the Vietnam General Department of Taxation’s public lookup service. If the code resolves to a valid registration, it’s accepted. If it doesn’t resolve, it’s flagged for review — not auto-rejected. The distinction matters: auto-rejection generates noise. Flagging generates actionable items.

What This Means for OCR Pipeline Design

The failure modes above share a pattern: each one looks manageable in isolation, but they cluster. A single contract from a foreign-invested enterprise in an industrial zone, signed pre-2008, with an overseas payment arrangement, annotated by a procurement manager — that document could trigger all five failure modes simultaneously.

A pipeline designed for the 80% clean case will fail on the 20% hard case. And the hard cases are rarely random. They cluster around specific supplier types, contract categories, and time periods. Building for the hard cases first isn’t over-engineering. It’s the only way to get a pipeline that actually works in production.

The architectural lesson is that document AI for Vietnamese contracts needs multiple specialised components — not one general-purpose model. A preprocessing layer for colour and annotation detection. Language-aware extraction that handles bilingual text. A normalisation layer that understands local phrasing conventions. A validation layer that knows historical formats. Human review as a structured output, not an admission of failure.

We’ve built this into our document AI pipeline for Odoo-based procurement workflows. The pipeline is opinionated by design: it assumes the hard cases exist and routes them explicitly, rather than pretending a confidence score is the same thing as accuracy.

At Trobz, we build and maintain document AI pipelines for Vietnamese and SEA procurement contexts — from initial OCR through to validated data in Odoo. If you’re running into similar failure patterns, we’re happy to compare notes.