Womblex Development Fixtures
Building a document extraction pipeline means testing against real documents, not just clean samples. This repository provides a curated set of test fixtures drawn from established OCR research datasets and real Australian government document releases.
Each fixture preserves the original ground truth format of its source dataset, so accuracy metrics are directly comparable to published benchmarks.
Who is this for?
- Developers building or benchmarking document extraction pipelines
- Researchers evaluating OCR accuracy across document types
- Anyone contributing to Womblex who needs test data
What's included
The collection covers four source datasets, selected for variety across difficulty and content type:
FUNSD: Form Understanding in Noisy Scanned Documents
Five samples ranging from sparse forms (25 words) to dense forms (434 words) and header-heavy layouts. PNG images with JSON annotations.
- Paper: Jaume et al., ICDAR-OST 2019
- Licence: See dataset terms
IAM Handwriting Database (line level)
Five handwriting samples covering single words through to long sentences, and narrow through to wide images. PNG images with .gt.txt sidecar files.
- Paper: Marti & Bunke, IJDAR 2002
- Licence: MIT (HuggingFace distribution by Teklia)
DocLayNet: Document Layout Segmentation
Five samples covering diverse layouts, table-heavy documents, formulas, and sparse-to-dense text. PNG images with JSON sidecar files.
- Paper: Pfitzmann et al., KDD 2022
- Licence: CDLA-Permissive 2.0
Womblex Collection — Real Australian Government Documents
Mixed-format documents covering the file types Womblex targets: redacted PDFs, large native PDFs, Word documents, CSVs, and multi-sheet Excel files. Each comes with a _transcript.txt sidecar for ground truth comparison.
- Source: Australian government document releases and open data portals
- Licence: Commonwealth of Australia (various Crown copyright terms)
Documents include an ACT regulatory decision letter (3 pages, redacted), an ANAO Major Projects Report (406 pages), DFAT portfolio budget statements (DOCX), a childcare provider registry (10,859 rows CSV), and fuel stockholding statistics (7-sheet XLSX).
Pre-built extraction outputs
The repository also includes serialised extraction results and pre-chunked text for each document, so you can work with the pipeline outputs without running extraction yourself. These are stored as JSON files in _extracted/ and _chunks/ directories.
How to use
The fixtures are included as a git submodule in the main Womblex repository. If you're developing Womblex:
git submodule update --init
If you want the fixtures standalone:
git clone https://github.com/DeepCivic/womblex-development-fixtures.git
Source code and licence
github.com/DeepCivic/womblex-development-fixtures — Apache 2.0
Individual datasets carry their own licences as noted above.
Last updated: April 2026