Technical Graphs

Video overview

Start here: what HAIKU is and why it was built

Councillor Nathan Zamprogno introduces the Hawkesbury Artificial Intelligence Knowledge Utility before the technical diagrams below unpack its retrieval, data-processing, and answer-generation architecture.

HAIKU launch video featuring Councillor Nathan Zamprogno — A plain-English introduction to HAIKU’s purpose, capabilities, and approach to making Council records easier to explore.

System shape

Not a generic chatbot wrapped around a search box

Corpus engineering

The system ingests council business papers, minutes, attachments, policies, planning instruments, web-crawled pages, consultation sites, data-portal pages, demographic profiles, and legacy PDFs. Source metadata is preserved so answers can point back to meetings, documents, URLs, dates, and corpus families.

Hybrid retrieval

Documents are converted into Markdown or structured records, chunked, and embedded in Chroma. At question time, semantic matches are combined with exact-word FTS5/BM25 matches, fused and reranked, then reduced to a citation-ready evidence pack rather than asking the model to rely on memory.

Custom executors

Repeatable civic questions are routed to deterministic code paths where possible: CouncilStats comparisons, motion and vote analysis, meeting histories, attendance, finances, planning controls and guidance, consultation feedback, capital projects, rates, roads, grants, and other structured records.

LLM synthesis

The selected OpenAI model is used where language understanding and synthesis matter most: selecting likely sources, interpreting ambiguous questions, resolving constrained planning or statistical terms, and explaining the evidence. Hard-coded validation and retrieval still govern what reaches the answer.

Hybrid answer path

Custom executors first, LLM synthesis where it adds value

This data-flow view shows how the system avoids treating every query as a blank natural-language problem. Structured civic questions are routed through deterministic executors, open-ended questions use fused semantic and lexical retrieval, and the LLM receives a compact evidence pack for the final explanation.

Data flow diagram showing user questions routed through FastAPI, query planning, custom executors, hybrid semantic and lexical retrieval, OpenAI synthesis, and cited answer rendering. — The main efficiency gain is architectural: the application narrows the problem before invoking the model. The LLM is still central, but it works from curated facts, passages, and citations rather than broad raw context.

Executor route map: how syntax changes the answer path

The planner combines syntax cues such as rank, list all, by year, voted with, can I build, documents mention, or a street address with automatic corpus selection and constrained semantic nominations. Results are then checked against what the question actually asked for before they are accepted.

Infographic showing how HAIKU routes question syntax to validated custom executors or cited hybrid retrieval. — The custom executors act like specialised civic calculators. They are strongest for repeatable records questions such as voting, attendance, recusals, planning controls, financial series, tenders, grants, and topic histories; open-ended synthesis falls back to cited hybrid retrieval and LLM reasoning. Public chat never executes model-generated Python.

Interactive artifacts

Two graph views of the project

These previews are screenshots. Select either image to open the full interactive graph on this same host.

Interactive graph of the codebase structure and communities

Code graph Codebase structure and module communities The refreshed Graphify scan maps 5,359 public-safe code and documentation nodes, 13,836 relationships, and 243 communities. The largest communities are relabelled into human-readable functional areas.

Interactive semantic graph of document themes and document families

Semantic document atlas Themes, subthemes, sources, and document families The refreshed atlas groups 21,113 corpus records into 13,969 canonical document families across eight sources, 28 civic themes, and 136 subthemes, revealing more detail as the viewer zooms in.

Admin workflow

A complete evidence production pipeline, not a one-off upload

The admin interface coordinates the local evidence factory behind the chatbot: source discovery, document download, segmentation, Markdown conversion, structured sidecar generation, vector and lexical indexing, citation previews, quality checks, and deployment sync all sit in one workflow.

Admin Council Web Crawler page showing synced meeting-year folders and downloaded Council documents. — Council Web Crawler: tracks yearly meeting documents and downloads new or changed source files.

Admin Semantic Chunker page showing Markdown files selected for vector embedding. — Semantic Chunker: segments converted Markdown into context-aware chunks for vector embeddings.

Admin LLM and database management page showing vector database statistics, local models, and cloud model options. — LLM and DB Management: monitors vector-store size, the selected cloud model, and token costs.

The workflow starts by scraping configured source websites, including Council meeting repositories, Council web corpuses, consultation and data sites, and Councillor Zamprogno's site. New and changed files are downloaded, catalogued, and checked against prior metadata so repeat runs can skip unchanged material.

Meeting papers are then split into item-level records: agendas and minutes are segmented, attachments are distributed to their parent items, page-number provenance is correlated, and PDF material is converted into Markdown. Those Markdown files are chunked and embedded into sharded Chroma vector stores.

In parallel, preprocessing compiles JSON sidecars and global indexes for factual executors: votes, attendance, conflicts, keywords, financial tables, rates, capital works, road-network subtotals, procurement, grants, CouncilStats, and other recurring civic statistics. A generated SQLite FTS5 index mirrors the searchable Chroma chunks, while preview workflows create safe first-page or website snapshots for citations. Cloud synchronisation then compares and transfers the tested serving assets to the Oracle deployment.

Vertical preprocessing workflow showing admin stages from source scraping through item folders, Markdown conversion, embeddings, sidecar and JSON compilation, validation, cloud synchronisation, and public chatbot serving. — The preparation path is deliberately staged: scrape first, preserve page and item provenance, convert to searchable Markdown, build vectors and structured JSON, then sync the tested local build to the cloud.

Retrieval strategy

How a question becomes an answer

1

Parse intent

The backend resolves dates and Council terms, classifies the request, selects likely corpuses, and may nominate a compatible executor with typed, grounded slots. Lexical route matches retain priority.

2

Select evidence

The public chat defaults to automatic source selection. Ordinary search runs Chroma semantic retrieval and SQLite FTS5/BM25 retrieval in parallel, then combines their ranked results with reciprocal-rank fusion.

3

Route to tools

If a deterministic executor is a better fit, the dispatcher runs it and validates the result against the question's obligations. Useful but incomplete results can be supplemented with focused retrieval.

4

Synthesise with citations

Retrieved passages and executor outputs are assembled into a prompt for the LLM. The answer is rendered with visible citations, grouped meeting-item sources, optional previews, activated-corpus disclosure, and guardrails against unsupported claims.

Tooling

What each layer contributes

FastAPI backend Authentication, chat APIs, streaming responses, corpus routing, executor dispatch, and admin workflows.

Vite frontend Public chat interface, saved sessions, source controls, account overlay, and this technical graph page.

Chroma + SQLite FTS5 Parallel semantic and exact-word retrieval, fused into one ranked evidence pool across meetings, policies, and web corpuses.

OpenAI models Question interpretation, answer synthesis, graph taxonomy generation, and human-readable cluster labelling.

Graphify and graph builders Repository structure visualisation plus a bespoke semantic atlas for document themes and corpus relationships.

Custom executors Deterministic, validated answers for repeatable civic-information tasks where document retrieval alone is too noisy or incomplete.

Why the hybrid approach matters

Lower token costs, better reliability

A retrieval-only chatbot can find plausible passages, but it often spends tokens rediscovering structure the application already knows: meetings have dates, agenda items have identifiers, votes have named participants, and policies have stable document families. Encoding that knowledge in custom executors reduces the amount of context sent to the model and improves answer repeatability.

The LLM remains essential, but it is used as a reasoning and explanation layer. The lower layers narrow the question, fuse semantic and exact-word evidence, validate structured results, preserve source metadata, and produce compact intermediate results. That creates a more interrogable system: people can inspect the corpus, route, executor output, citations, previews, and the visual graphs that explain the project structure.

Cybersecurity

Standard protections built into the system

The HAIKU public chatbot and admin dashboard use separate access controls, and the cloud deployment blocks local-only maintenance workflows that should remain on the workstation.

Authentication and sessions

Separate public-user and administrator authentication flows.
Google sign-in plus email registration with confirmation links.
BCrypt password hashing and tokenised password-reset flows for email accounts.
Signed JWT sessions with role and token-type checks before protected actions.

Access control

Admin APIs and the admin chat endpoint require admin JWTs.
Cloud mode blocks local ingestion, crawler, database, LLM, and pipeline maintenance endpoints.
Configurable CORS origins restrict browser API access to approved frontend hosts.
Secrets, environment files, local databases, logs, virtual environments, and cache folders are excluded from cloud code sync targets.

Abuse and cost controls

IP and user rate limits on login, registration, password reset, Google auth, and public chat.
Rolling public-chat quotas by query count, with optional token and estimated-cost quotas.
Maximum query length enforcement before expensive retrieval or model calls.
Administrator controls for custom user quotas and account suspension.
Generated Python is disabled for public chat and cloud serving; the optional constrained runner is local-admin only.

Auditability

Security audit events record authentication failures, rate limits, blocked access, quota blocks, and admin actions.
Audit metadata includes request path, method, IP address, user agent, actor type, user ID, and status code where available.
Admin dashboards expose user activity, question history, quota state, cloud usage, and moderation controls.
Source citations, page links, executor methodology notes, and structured sidecars help trace answer provenance.

Project scale audit

What the system is built on

Counts separate application code, semantic document records, vector-search chunks, and structured extractor records so the project size is visible at more than one level. This audit was refreshed on 18 July 2026 alongside both interactive graphs.

Evidence base 1,151 actual dated meetings

Council meeting coverage runs from 10 February 1981 through the latest available 2026 Council and HLPP material. The published 1981-2006 archive adds 558 meeting folders to the searchable evidence base.

137,266 source lines of application, tests, and tooling code

21,113 document cards mapped in the semantic atlas

361,787 embedded retrieval chunks in Chroma

334,444 structured extracted records

Codebase

Python: 96,536 lines
HTML: 13,856 lines
JavaScript: 17,848 lines
CSS: 8,568 lines
Shell: 458 lines

Counted with tools/count_project_lines.py. Excludes generated frontend bundles, corpuses, virtual environments, vector databases, logs, and runtime data.

Vector-store footprint

Sharded serving store: 17 GB
Chroma SQLite files: 13 GB
Shard databases: 58 files
Serving path: chroma_shards

Physical disk usage for backend/database/chroma_shards, excluding legacy fallback Chroma folders outside the serving shard layout.

Semantic document atlas

Document cards: 21,113
Canonical families: 13,969
Themes / subthemes: 28 / 136
Source families: 8

The public atlas renders 1,800 representative document families while counting restricted titles without exposing them as individual nodes.

Meeting coverage

Council meetings: 1,101 actual
Cancelled placeholders: 1
HLPP meetings: 50
Oldest Council record: 10 Feb 1981

The 1981-2006 contribution is counted from meeting folders in Distributed_Corpus_1981-2006. Pre-2007 records are document-level; structured voting and attendance detail is strongest from 2007 onward.

Semantic cards by source

Council meetings7,288

Your Hawkesbury Your Say4,738

Profile.ID demographics3,783

Legacy PDF archive2,464

HCC website2,268

Policy and planning documents318

Open Data Portal213

Companion Animal Shelter41

Compiled knowledge

Financial table data points316,320

Issue tracker entries7,288

CouncilStats rows5,228

Capital works project records4,478

Conflict-of-interest entries592

Grant and rates records538

Keyword assignments47,276