Womblex
Digging usable data out of civic documents.
Government document releases arrive as a mix of PDFs (native, scanned, redacted, hybrid), Word documents, and spreadsheets. One-size-fits-all OCR fails because each format and sub-type needs a different extraction strategy. Womblex detects the document type first, then routes to the right extractor.
The result is structured text in Parquet format. One row per extraction unit with full text, metadata, tables, and confidence scores. Spreadsheets produce one result per logical row, ready for per-record analysis.
Who is this for?
- Researchers processing FOI releases or government document dumps
- Developers building pipelines over Australian civic documents
- Anyone who needs machine-readable text from a pile of mixed-format government files
Design principles
This project is designed for everyone. All design decisions favour air-gapped edge deployment, running on limited resources. No GPU required. No system-level dependencies beyond Python.
Womblex doesn't try to be an all-in-one OCR solution. It uses lightweight, targeted extractors for each document type and benchmarks against mature OCR models to guide development.
What it handles
| Format | What Womblex does |
|---|---|
| Native PDFs | Direct text extraction via PyMuPDF |
| Scanned PDFs | OCR with PaddleOCR + YOLO layout analysis |
| Hybrid PDFs | Text extraction for readable pages, OCR for the rest |
| Redacted PDFs | Extraction with [REDACTED] marker preservation |
| Word documents | Paragraph and table extraction |
| Spreadsheets (CSV, XLSX) | Per-row extraction with automatic sheet type classification |
Beyond extraction
After extracting text, Womblex can optionally:
- Chunk text into semantically meaningful pieces using semchunk
- Detect and mask redactions as a separate post-extraction stage
- Strip PII (person names, with more entity types planned)
- Enrich chunks with embeddings, classification, and entity extraction via Isaacus
- Build entity graphs from enrichment outputs for further analysis
How to get started
pip install womblex
Process individual files:
womblex extract document.pdf -o output/
womblex extract report.docx -o output/
womblex extract dataset.xlsx -o output/
Or process a full document set with a YAML config:
womblex run --config configs/example.yaml
If processing is interrupted, resume from the last checkpoint:
womblex run --config configs/example.yaml --resume
For semantic enrichment (embeddings, classification, entity extraction), install the Isaacus extra and set an API key:
pip install womblex[isaacus]
export ISAACUS_API_KEY="your-key-here"
Text extraction works without an API key. The enrichment stage is optional.
Output
Processing produces Parquet files. The base output is documents.parquet with one row per extraction unit. With the Isaacus enrichment enabled, you also get entities.parquet, graph_edges.parquet, and enrichment_meta.parquet.
Development fixtures
A companion repository of curated test documents is available for anyone developing extraction pipelines or benchmarking OCR. See Womblex Development Fixtures.
Source code and licence
github.com/DeepCivic/womblex — Apache 2.0
Last updated: April 2026