Development Fixtures for Womblex
Curated samples from public datasets and real Australian government documents for extraction pipeline development. Each subset preserves the original ground truth format of its source dataset.
Datasets
FUNSD
Form Understanding in Noisy Scanned Documents.
- Authors: Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran
- Paper: FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents (ICDAR-OST 2019)
- License: See dataset terms at https://guillaumejaume.github.io/FUNSD/
- Format: PNG images + JSON annotations (
{form: [{box, text, label, words, linking}]})
IAM Handwriting Database (line level)
- Authors: U. Marti, H. Bunke (University of Bern)
- Paper: The IAM-database: an English sentence database for offline handwriting recognition (IJDAR 2002)
- Homepage: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database
- License: MIT (HuggingFace distribution by Teklia)
- Format: PNG images +
.gt.txtsidecar files (standard HTR convention)
DocLayNet
- Authors: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, Peter Staar (IBM Research)
- Paper: DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation (KDD 2022)
- License: CDLA-Permissive 2.0
- Homepage: https://github.com/DS4SD/DocLayNet
- Format: PNG images + JSON sidecar files (
{bboxes, labels, words, split})
Womblex Collection
Real Australian government documents covering the formats womblex targets.
- Source: Australian government document releases and open data portals
- License: Commonwealth of Australia (various Crown copyright terms)
- Format: Mixed — PDF, DOCX, CSV, XLSX with
_transcript.txtsidecar files
Documents (_documents/)
| File | Format | Pages | Description | Ground Truth |
|---|---|---|---|---|
00768-213A-...Redacted | 3 | ACT regulatory decision letter (redacted) | Human transcript with <REDACTED> tags | |
Auditor-General_Report_2020-21_19 | 406 | ANAO Major Projects Report (Defence) | Womblex transcript, pending review | |
foreign-affairs-and-trade-2025-26-portfolio-budget-statements | DOCX | 18 | DFAT portfolio budget statements | Womblex transcript, pending review |
Spreadsheets (_spreadsheets/)
| File | Format | Rows | Description | Ground Truth |
|---|---|---|---|---|
Approved-providers-au-export_20260204 | CSV | 10,859 | Childcare provider registry | Womblex transcript, pending review |
mso-statistics-sept-qtr-2025 | XLSX | 7 sheets | Fuel stockholding statistics (DCCEEW) | Womblex transcript, pending review |
Extracted (_extracted/)
Serialised ExtractionResult objects from running the real extraction pipeline against each source document. Each JSON file contains the full page-level text, text blocks with positions, tables, metadata, and computed full_text.
| File | Source | Results | Pages | Chars |
|---|---|---|---|---|
throsby-redacted_extracted.json | Redacted PDF | 1 | 3 | 4,898 |
auditor-general_extracted.json | ANAO PDF (first 30 pages) | 1 | 30 | 76,514 |
dfat-budget-statements_extracted.json | DFAT DOCX | 1 | 1 | 99,706 |
mso-statistics_extracted.json | Fuel MSO XLSX | 34 | 34 | 34,786 |
approved-providers_extracted.json | Provider CSV | 10,859 | 10,859 | 2,422,497 |
Chunks (_chunks/)
Pre-chunked versions of the extracted text, generated with a word-count tokeniser at chunk_size=200 (matching the test suite baseline). Each JSON file contains metadata, chunking config, stats, and per-extraction-result chunk arrays with character offsets back into the extracted full_text.
| File | Source | Chunks | Avg Chars |
|---|---|---|---|
throsby-redacted_chunks.json | Redacted PDF | 5 | 978 |
auditor-general_chunks.json | ANAO PDF (30 pages) | 64 | 1,194 |
dfat-budget-statements_chunks.json | DFAT DOCX | 80 | 1,246 |
mso-statistics_chunks.json | Fuel MSO XLSX | 36 | 966 |
approved-providers_chunks.json | Provider CSV | 10,867 | 222 |
Regenerate both with: uv run python scripts/generate_chunk_fixtures.py
Fixture selection
Samples were selected for variety across difficulty and content type:
| Dataset | Sample | Description |
|---|---|---|
| FUNSD | 85540866 | Sparse form (25 words) |
| FUNSD | 82200067_0069 | Median density (181 words) |
| FUNSD | 87594142_87594144 | Dense form (434 words) |
| FUNSD | 87528321 | Form-heavy (60 question fields) |
| FUNSD | 87528380 | Header-heavy (11 headers, 108 fields) |
| IAM-line | short_1602 | Single word, narrow image |
| IAM-line | median_15 | Typical 9-word line |
| IAM-line | long_4 | Long 22-word sentence |
| IAM-line | wide_1739 | Widest image (4369px) |
| IAM-line | narrow_1163 | Narrowest image (187px) |
| DocLayNet | diverse_layout_49 | 7 layout label types |
| DocLayNet | table_0 | Table-heavy document |
| DocLayNet | formula_29 | Contains formulas |
| DocLayNet | sparse_text_344 | Minimal text (13 words) |
| DocLayNet | dense_text_548 | Dense text (413 words) |
| womblex-collection | Throsby...Redacted | Redacted native PDF, 3 pages (730 words) |
| womblex-collection | Auditor-General_Report_2020-21_19 | Native PDF, 406 pages (193k words) |
| womblex-collection | foreign-affairs-and-trade-2025-26-portfolio-budget-statements | DOCX portfolio budget statements |
| womblex-collection | Approved-providers-au-export | CSV, 10,859 provider records |
| womblex-collection | mso-statistics-sept-qtr-2025 | XLSX, 7 sheets of fuel statistics |
Source Code
github.com/DeepCivic/womblex-development-fixtures
Feedback
You can email the team directly: team@deepcivic.com.au