Development Fixtures for Womblex

Curated samples from public datasets and real Australian government documents for extraction pipeline development. Each subset preserves the original ground truth format of its source dataset.

Datasets

FUNSD

Form Understanding in Noisy Scanned Documents.

IAM Handwriting Database (line level)

DocLayNet

Womblex Collection

Real Australian government documents covering the formats womblex targets.

  • Source: Australian government document releases and open data portals
  • License: Commonwealth of Australia (various Crown copyright terms)
  • Format: Mixed — PDF, DOCX, CSV, XLSX with _transcript.txt sidecar files

Documents (_documents/)

FileFormatPagesDescriptionGround Truth
00768-213A-...RedactedPDF3ACT regulatory decision letter (redacted)Human transcript with <REDACTED> tags
Auditor-General_Report_2020-21_19PDF406ANAO Major Projects Report (Defence)Womblex transcript, pending review
foreign-affairs-and-trade-2025-26-portfolio-budget-statementsDOCX18DFAT portfolio budget statementsWomblex transcript, pending review

Spreadsheets (_spreadsheets/)

FileFormatRowsDescriptionGround Truth
Approved-providers-au-export_20260204CSV10,859Childcare provider registryWomblex transcript, pending review
mso-statistics-sept-qtr-2025XLSX7 sheetsFuel stockholding statistics (DCCEEW)Womblex transcript, pending review

Extracted (_extracted/)

Serialised ExtractionResult objects from running the real extraction pipeline against each source document. Each JSON file contains the full page-level text, text blocks with positions, tables, metadata, and computed full_text.

FileSourceResultsPagesChars
throsby-redacted_extracted.jsonRedacted PDF134,898
auditor-general_extracted.jsonANAO PDF (first 30 pages)13076,514
dfat-budget-statements_extracted.jsonDFAT DOCX1199,706
mso-statistics_extracted.jsonFuel MSO XLSX343434,786
approved-providers_extracted.jsonProvider CSV10,85910,8592,422,497

Chunks (_chunks/)

Pre-chunked versions of the extracted text, generated with a word-count tokeniser at chunk_size=200 (matching the test suite baseline). Each JSON file contains metadata, chunking config, stats, and per-extraction-result chunk arrays with character offsets back into the extracted full_text.

FileSourceChunksAvg Chars
throsby-redacted_chunks.jsonRedacted PDF5978
auditor-general_chunks.jsonANAO PDF (30 pages)641,194
dfat-budget-statements_chunks.jsonDFAT DOCX801,246
mso-statistics_chunks.jsonFuel MSO XLSX36966
approved-providers_chunks.jsonProvider CSV10,867222

Regenerate both with: uv run python scripts/generate_chunk_fixtures.py

Fixture selection

Samples were selected for variety across difficulty and content type:

DatasetSampleDescription
FUNSD85540866Sparse form (25 words)
FUNSD82200067_0069Median density (181 words)
FUNSD87594142_87594144Dense form (434 words)
FUNSD87528321Form-heavy (60 question fields)
FUNSD87528380Header-heavy (11 headers, 108 fields)
IAM-lineshort_1602Single word, narrow image
IAM-linemedian_15Typical 9-word line
IAM-linelong_4Long 22-word sentence
IAM-linewide_1739Widest image (4369px)
IAM-linenarrow_1163Narrowest image (187px)
DocLayNetdiverse_layout_497 layout label types
DocLayNettable_0Table-heavy document
DocLayNetformula_29Contains formulas
DocLayNetsparse_text_344Minimal text (13 words)
DocLayNetdense_text_548Dense text (413 words)
womblex-collectionThrosby...RedactedRedacted native PDF, 3 pages (730 words)
womblex-collectionAuditor-General_Report_2020-21_19Native PDF, 406 pages (193k words)
womblex-collectionforeign-affairs-and-trade-2025-26-portfolio-budget-statementsDOCX portfolio budget statements
womblex-collectionApproved-providers-au-exportCSV, 10,859 provider records
womblex-collectionmso-statistics-sept-qtr-2025XLSX, 7 sheets of fuel statistics

Source Code

github.com/DeepCivic/womblex-development-fixtures

Feedback

You can email the team directly: team@deepcivic.com.au

© 2026 DeepCivic. All rights reserved.