Womblex Development Fixtures

Building a document extraction pipeline means testing against real documents, not just clean samples. This repository provides a curated set of test fixtures drawn from established OCR research datasets and real Australian government document releases.

Each fixture preserves the original ground truth format of its source dataset, so accuracy metrics are directly comparable to published benchmarks.

Who is this for?

Developers building or benchmarking document extraction pipelines
Researchers evaluating OCR accuracy across document types
Anyone contributing to Womblex who needs test data

What's included

The collection covers four source datasets, selected for variety across difficulty and content type:

FUNSD: Form Understanding in Noisy Scanned Documents

Five samples ranging from sparse forms (25 words) to dense forms (434 words) and header-heavy layouts. PNG images with JSON annotations.

Paper: Jaume et al., ICDAR-OST 2019
Licence: See dataset terms

IAM Handwriting Database (line level)

Five handwriting samples covering single words through to long sentences, and narrow through to wide images. PNG images with .gt.txt sidecar files.

Paper: Marti & Bunke, IJDAR 2002
Licence: MIT (HuggingFace distribution by Teklia)

DocLayNet: Document Layout Segmentation

Five samples covering diverse layouts, table-heavy documents, formulas, and sparse-to-dense text. PNG images with JSON sidecar files.

Paper: Pfitzmann et al., KDD 2022
Licence: CDLA-Permissive 2.0

Womblex Collection — Real Australian Government Documents

Mixed-format documents covering the file types Womblex targets: redacted PDFs, large native PDFs, Word documents, CSVs, and multi-sheet Excel files. Each comes with a _transcript.txt sidecar for ground truth comparison.

Source: Australian government document releases and open data portals
Licence: Commonwealth of Australia (various Crown copyright terms)

Documents include an ACT regulatory decision letter (3 pages, redacted), an ANAO Major Projects Report (406 pages), DFAT portfolio budget statements (DOCX), a childcare provider registry (10,859 rows CSV), and fuel stockholding statistics (7-sheet XLSX).

Pre-built extraction outputs

The repository also includes serialised extraction results and pre-chunked text for each document, so you can work with the pipeline outputs without running extraction yourself. These are stored as JSON files in _extracted/ and _chunks/ directories.

How to use

The fixtures are included as a git submodule in the main Womblex repository. If you're developing Womblex:

git submodule update --init

If you want the fixtures standalone:

git clone https://github.com/DeepCivic/womblex-development-fixtures.git

Source code and licence

github.com/DeepCivic/womblex-development-fixtures — Apache 2.0

Individual datasets carry their own licences as noted above.

Last updated: April 2026