Womblex

Digging usable data out of civic documents.

Government document releases arrive as a mix of PDFs (native, scanned, redacted, hybrid), Word documents, and spreadsheets. One-size-fits-all OCR fails because each format and sub-type needs a different extraction strategy. Womblex detects the document type first, then routes to the right extractor.

The result is structured text in Parquet format. One row per extraction unit with full text, metadata, tables, and confidence scores. Spreadsheets produce one result per logical row, ready for per-record analysis.

Who is this for?

Researchers processing FOI releases or government document dumps
Developers building pipelines over Australian civic documents
Anyone who needs machine-readable text from a pile of mixed-format government files

Design principles

This project is designed for everyone. All design decisions favour air-gapped edge deployment, running on limited resources. No GPU required. No system-level dependencies beyond Python.

Womblex doesn't try to be an all-in-one OCR solution. It uses lightweight, targeted extractors for each document type and benchmarks against mature OCR models to guide development.

What it handles

Format	What Womblex does
Native PDFs	Direct text extraction via PyMuPDF
Scanned PDFs	OCR with PaddleOCR + YOLO layout analysis
Hybrid PDFs	Text extraction for readable pages, OCR for the rest
Redacted PDFs	Extraction with `[REDACTED]` marker preservation
Word documents	Paragraph and table extraction
Spreadsheets (CSV, XLSX)	Per-row extraction with automatic sheet type classification

Beyond extraction

After extracting text, Womblex can optionally:

Chunk text into semantically meaningful pieces using semchunk
Detect and mask redactions as a separate post-extraction stage
Strip PII (person names, with more entity types planned)
Enrich chunks with embeddings, classification, and entity extraction via Isaacus
Build entity graphs from enrichment outputs for further analysis

How to get started

pip install womblex

Process individual files:

womblex extract document.pdf -o output/
womblex extract report.docx -o output/
womblex extract dataset.xlsx -o output/

Or process a full document set with a YAML config:

womblex run --config configs/example.yaml

If processing is interrupted, resume from the last checkpoint:

womblex run --config configs/example.yaml --resume

For semantic enrichment (embeddings, classification, entity extraction), install the Isaacus extra and set an API key:

pip install womblex[isaacus]
export ISAACUS_API_KEY="your-key-here"

Text extraction works without an API key. The enrichment stage is optional.

Output

Processing produces Parquet files. The base output is documents.parquet with one row per extraction unit. With the Isaacus enrichment enabled, you also get entities.parquet, graph_edges.parquet, and enrichment_meta.parquet.

Development fixtures

A companion repository of curated test documents is available for anyone developing extraction pipelines or benchmarking OCR. See Womblex Development Fixtures.

Source code and licence

github.com/DeepCivic/womblex — Apache 2.0

Last updated: April 2026