About

This is the technical methodology for the DeepCivic investigation into childcare regulatory compliance in the Australian Capital Territory. It accompanies the parent article, How to Choose Child Care in Canberra.

The investigation analyses 1,444 regulatory documents released by the ACT Government covering childcare incidents, complaints, and enforcement actions from 2020 to 2025.

Pipeline steps

1. PDF Extraction

Womblex (pdfplumber-based) extracts text blocks in reading order, detects tables, and captures filename metadata.

Input: 1,700+ PDFs | Output: 1,518 text documents

2. Cleaning

Removed: corrupted files, empty documents, cover pages, index documents, duplicates (by document ID).

Output: 1,444 documents

3. Chunking

Semchunk splits long documents into smaller pieces that fit the enrichment model’s context window. It uses semantic boundaries (paragraph breaks, then sentence boundaries) to avoid splitting mid-thought.

Configuration: isaacus/kanon-2-tokenizer, target 480 tokens, 512-token context window.

4. Entity enrichment

Kanon-2 Enricher extracts structured information from each document:

  • Service IDs: SE-\d{5,}
  • Provider IDs: PR-\d{5,}
  • Notification refs: NOT-\d{8}
  • Case refs: CAS-\d{8} or CA-\d{8}
  • Person names, dates, locations
  • Document structure (sections, paragraphs)

Output format: Isaacus Legal Graph Schema (ILGS)

5. Document embeddings

Kanon-2 Embedder produces 768-dimensional vectors per document. Documents about similar topics produce similar vectors.

Similarity measure: cosine similarity

6. Topic classification

Centroid-based classification:

  1. Manually select 3–5 clear example documents per topic
  2. Compute centroid: mean(embeddings[seed_docs])
  3. Score each document: cosine_similarity(doc_embedding, centroid)
  4. Flag top 25% as positive for that topic

This approach is transparent and auditable — the seed documents that define each category are recorded in semantic_tag_examples.csv.

Topics:

CategoryTags
Enforcementpenalty_applied, conditional_direction
Harmphysical_harm_occurred, sexual_behaviour
Supervisionsupervision_failure, ratio_breach
Patternadult_alleged, recurrence_documented, multiple_children
Locationoutdoor, indoor_common, private_spaces

7. Graph construction

Two graph structures built from enrichment entities:

Bipartite graph:

  • Nodes: documents (1,444), entities (1,142 — notifications, services, providers, cases)
  • Edges: document mentions entity (2,300 edges)

enhanced_bipartite_graph.webp

Document graph:

  • Nodes: documents (1,444)
  • Edges: documents share entity reference (11,368 edges, 737 connected components)

semantic_document_graph.webp

8. Document-service linking

Many documents lack explicit SE-XXXXX references. We use three strategies to maximise linkage:

StrategyMethodConfidenceDocuments Recovered
SE referenceDirect SE-XXXXX in documentHigh546 (baseline)
Provider resolutionPR-XXXXX → lookup servicesHigh+143
Name matchingService name in text (min 12 chars)Moderate+186

Total linked: 875 documents (60.6% of corpus). The remaining 569 documents (39.4%) are orphans that cannot be attributed to a specific service location.

9. External data merge

SourceFields Used
ACECQA registersCapacity, NQS ratings (overall + QA1–7, filtered to 2020+), service type, location
StartingBlocks/ToddleDaily fees
ABS Census 2021SA2 population, median income
ACT PolicingAssault and sexual offence counts by suburb (2020 Q1 – 2025 Q2)
GNAFAddress → SA2 mapping

10. Harm rate calculation

Harm definition:

harm_indicated = flag_physical_harm_occurred OR flag_sexual_behaviour

Pooled harm rate:

harm_rate = sum(harm_docs) / sum(capacity) × 100 / years

Where:

  • harm_docs = count of documents with harm_indicated = True
  • capacity = licensed places (not enrolled children)
  • years = 5 (2020–2025)
  • Result = harm documents per 100 places per year

Pooled rates weight by exposure (child-places), not by service count. This answers the parent-relevant question: “For a randomly selected child-place in this category, what’s the expected harm rate?”

11. Geographic aggregation

SA2 mapping via GNAF address matching. District assignment from SA2 → district lookup table.

Aggregation: sum harm_docs and capacity per SA2/district, then apply pooled rate formula.

12. Correlation analysis

Categorical variables: Group by attribute (NQS rating, service type, capacity band), calculate pooled rate per group. NQS ratings issued before 2020 are excluded — services with stale ratings have their NQS columns nulled and are omitted from NQS groupings (but still included in non-NQS analyses).

Continuous variables: Pearson correlation coefficient between SA2-level harm rate and variable (income, crime rates).

Signal strength (based on service count in each group):

ServicesSignal
< 5very_weak
5–14weak
15–29moderate
≥ 30strong

13. Enforcement classification

Same centroid approach as topic classification, applied to regulatory decision documents.

Categories:

  • Immediate penalty: suspension, cancellation, prohibition
  • Conditional/remediation: compliance notice, show cause, directions

Privacy handling

  1. Extract person names from enrichments
  2. Filter out: pseudonyms in source (“Child A”, “Educator 1”), role labels, business names
  3. Generate pseudonymisation register: real name → “Person N”
  4. Register stored in DO-NOT-PUBLISH/ and not released

Data flow

Stage 1 — document processing

flowchart TD
    A[1,700+ PDFs from 213A] --> B[Womblex extraction]
    B --> C[1,518 raw text documents]
    C --> D[Deduplication & filtering]
    D --> E[1,444 usable documents]

    E --> F[Semchunk]
    E --> J[Kanon-2 Embedder]

    F --> G[Kanon-2 Enricher]
    G --> H[Entity extraction]
    G --> I[Document structure]
    H --> Y[PII register]

    J --> K[768-dim vectors]
    K --> M[Centroid classification]

Stage 2 — linking & external data

flowchart TD
    O[ACECQA registers] --> P[Master services]
    Q[StartingBlocks / Toddle pricing] --> P
    R[ABS Census 2021] --> P
    S[ACT Policing stats] --> P
    T[GNAF addresses] --> P

Stage 3 — analysis & output

flowchart TD
    N[Document-service links] --> U[Harm rate calculation]
    P[Master services] --> U

    U --> V[Geographic aggregation]
    U --> W[Correlation analysis]
    U --> X[Visualisations]

Tools

ToolPurpose
WomblexPDF/Excel/CSV extraction to parquet
semchunkSemantic chunking for enrichment context windows
Isaacus Kanon-2 EnricherEntity extraction, document structure
Isaacus Kanon-2 EmbedderDocument embeddings (768-dim)
pandas, numpyData manipulation
networkxGraph construction
geopandas, contextilyGeographic mapping
matplotlibVisualisation

Limitations

Selection bias. We only have documents the government chose to release. The full corpus may contain 20,000+ documents. We do not know the criteria used to select documents for release.

Document counts. While we tried to account for this in most cases, it’s likely that some documents do not reflect unique incidents.

Price data gaps. Not all services have pricing data available. We used ACT averages where data was missing, which may mask price variation.

Capacity vs enrolment. Rates use licensed capacity (maximum places), not actual enrolled children. True exposure rates would require enrolment data we don’t have.

Small samples. Many suburbs have only 1–3 services. Extreme rates in these areas may reflect statistical noise rather than genuine risk differences.

Correlation does not imply causation. Associations between harm rates and NQS ratings, demographics, or crime rates do not establish causal relationships.


Last updated: March 2026

© 2026 DeepCivic. All rights reserved.