Building the AGIFT graph

About

This is the technical methodology for the AGIFT Graph Builder. It accompanies the project README and source repository.

DeepCivic uses the AGIFT Graph Builder to categorise incomplete metadata from open Australian government datasets and map them to DCAT-AP, the metadata standard used by open data portals in Europe and internationally. The graph enables semantic matching between dataset descriptions and the Australian Government Interactive Functions Thesaurus (AGIFT), a controlled vocabulary of government business functions maintained by the National Archives of Australia.

AGIFT is a living vocabulary that the National Archives add, rename, and restructure terms over time. The pipeline runs on a weekly schedule to keep the graph current.

Pipeline steps

1. Vocabulary fetch

The pipeline walks the TemaTres REST API to retrieve the full AGIFT hierarchy across three levels: top-level functions (L1), secondary functions (L2), and detailed functions (L3). Alternative labels for each term are fetched in a separate pass.

API endpoint: https://vocabularyserver.com/agift/services.php

API task	Purpose	Example
`fetchTopTerms`	Retrieve L1 functions	"Business support and regulation"
`fetchDown`	Walk children of a term	L1 to L2, L2 to L3
`fetchAlt`	Get alternative labels	"hydrology monitoring"

Input: TemaTres REST API | Output: AGIFT term records with hierarchy and alternative labels

2. Graph construction

Terms are upserted into Neo4j as :Term nodes. A MERGE on term_id ensures idempotent writes. Terms are processed in depth order (L1 before L2 before L3) so parent nodes exist before child edges are created.

Each term stores its label, a normalised lowercase label, hierarchy depth (1-3), alternative labels, and a DCAT-AP theme code. The 23 AGIFT top-level functions each map to one of 17 DCAT-AP theme categories (for example, "Health care" maps to HEAL, "Environment" maps to ENVI). Child terms inherit their parent's theme.

Structural edges (PARENT_OF) link each term to its parent in the AGIFT hierarchy.

Change detection compares incoming labels and alternative labels against stored values. Only new or changed terms are flagged for re-embedding in the next stage.

3. Embedding generation

Each term's label is first expanded into a richer text representation that includes its full hierarchical path and alternative labels. For example: "Environment > Water resources management > Water quality monitoring (also known as: hydrology monitoring)". This path-based context produces more informative embeddings than standalone labels.

Two embedding providers are supported:

Provider	Model	Dimensions	Cost
Isaacus	kanon-2-embedder	256, 384, 512, 768, 1024, 1792	Paid API
Local	all-MiniLM-L6-v2	384	Free (CPU)
Local	all-mpnet-base-v2	768	Free (CPU)

The local provider uses sentence-transformers models running on CPU. Models are downloaded on first run and cached in a Docker volume.

Only new or changed terms are embedded unless a full re-embed is requested.

4. Semantic similarity edges

Cosine similarity is computed between all pairs of embedded terms to discover semantic relationships that the structural hierarchy does not capture.

Algorithm:

Fetch all embedded terms from Neo4j, grouped by embedding dimension
Compute pairwise cosine similarity within each dimension group
Create SIMILAR_TO edges where similarity meets the threshold (default 0.70)
Skip pairs that already share a PARENT_OF edge to avoid redundancy

All SIMILAR_TO edges are cleared and rebuilt each run.

The two edge types carry different weights for query-time flexibility:

Edge type	Weight	Description
`PARENT_OF`	1.0	Structural hierarchy
`SIMILAR_TO`	0.5	Semantic similarity

The similarity threshold and edge weight are configurable via the dashboard or CLI.

Data flow

flowchart TD
    A[TemaTres REST API] --> B[Hierarchy walk: L1, L2, L3 + alt labels]
    B --> C[Neo4j upsert]
    C --> D[Term nodes + PARENT_OF edges]
    C --> E[Change detection]
    E --> F[Hierarchical text construction]
    F --> G{Embedding provider}
    G -->|Isaacus| H[kanon-2-embedder API]
    G -->|Local| I[sentence-transformers]
    H --> J[Embedding vectors stored on nodes]
    I --> J
    J --> K[Pairwise cosine similarity]
    K --> L[SIMILAR_TO edges]

Deployment

The pipeline is designed for Docker. Three containers run the system: Neo4j for the graph database, a worker container that executes the pipeline on a weekly cron schedule (Wednesday 4:00 AM UTC), and a Flask dashboard for configuration and manual run control.

Tools

Tool	Purpose
TemaTres	AGIFT vocabulary source (REST API)
Neo4j	Graph database
Isaacus kanon-2-embedder	Cloud embedding API
sentence-transformers	Local CPU embeddings
Flask	Dashboard web framework
Docker + cron	Containerised deployment and scheduling

Limitations

Pairwise scaling. Cosine similarity is computed for all term pairs, which is O(n squared). This is tractable for AGIFT's current vocabulary size but would need approximate nearest neighbour methods for significantly larger vocabularies.

Threshold sensitivity. The 0.70 default cosine similarity threshold is a judgment call. This cosine made the most sense after manual review.

No incremental semantic edges. All SIMILAR_TO edges are deleted and rebuilt each run.

Cross-dimension incompatibility. Terms embedded at different dimensions cannot be compared. Changing the embedding dimension requires re-embedding all terms.

Local model quality. The sentence-transformers models are general-purpose English models, not fine-tuned for government vocabulary. The Isaacus kanon-2-embedder may produce better domain-specific results but requires a paid API key.

Source vocabulary. The pipeline reflects whatever the TemaTres API returns, including any errors in the source data.

Last updated: March 2026