Overview
High-level metrics for the RAG processing pipeline.
Total Documents
Processed
Failed / Errors
Pipeline Execution Logs
RAG Behavior Guidelines
Modify the foundational identity instructions for your Retrieval Augmented Generation bot. Note: citation behaviors and structural injection variables are hardcoded safely in the backend. Focus strictly on tone and conciseness rules here.
Distribute Attachments
Scans the LegacyCorpus for all files (excluding agendas/minutes) and maps them into the
flattened Distributed_Corpus item folders based on their attachment number naming
convention.
Initializing...
0 / 0 files processed
Orphaned Files Report
Files that could not be confidently mapped to a Distributed_Corpus item.
| Year | Meeting | Orphaned Filename | Reason |
|---|---|---|---|
| No distribution report available. Run the distribution cycle first. | |||
Conversion to Markdown
Convert item PDFs into semantically rich Markdown using Marker and/or Docling. Select files from the tree below and choose a conversion engine.
Initializing engine...
Distributed Corpus — PDF Files
Select PDFs for conversion. Status badges: ⬜ None 🟢 Marker 🔵 Docling ✅ Both
Semantic Chunker & Embeddings
Scans the `Distributed_Corpus` for Markdown files, strips them into 1500-character semantic blocks, enriches them with chronological context and JSON URLs, and ingests them into a local ChromaDB Vector Store using the BAAI/bge-m3 HuggingFace model.
🧠 Embedding Engine
Convert .md to ChromaDB Vectors
Distributed Corpus — Markdown Files
Select converted .md files to embed into Vector Space.
Document Expansion (Keywords and Issue Metadata)
Uses the currently selected LLM to read each main Agenda item report, generate better search terms, and write structured issue metadata for future longitudinal executors.
LLM Metadata Generation Pipeline
Compile Master Item and Issue Indexes
Sweeps the Distributed Corpus to rebuild the global item, voting, and issue-tracker JSON indexes used by structured executors.
Generation Logs
Attendance / Voting / Recusal Metadata Extraction
Rebuilds per-item voting sidecars, extracts conflict-of-interest and recusal sidecars, and then infers Councillor attendance from complete formal vote records.
Attendance / Voting / Recusal Pipeline
Extraction Logs
Website Article Scraping
Crawl councillorzamprogno.info via WP-JSON API, pull full posts, convert them into Markdown via Docling internally, and catalog metadata for RAG citation.
Website Scraping Pipeline
Live Scrape Logs
Council and Related Website Scrapers
Manage the main Hawkesbury Council website plus the related data, engagement, and profile portals. Each site below has its own HTML/PDF spider, Docling parser, vector injection pipeline, and cleanup controls.
Vector Database
- Total Indexed Chunks: Loading...
- Disk Footprint: Loading...
Danger Zone
Natively wipe isolated segments of the DB. You will need to re-run their specific pipelines to restore search functionality.
Local LLM Registry
Ollama: Checking...Installed Models
API Token Cost Tracker
Aggregated cloud usage across all sessions.
Cloud-based LLM models
OpenAI: ReadyRequires an active .env file with OPENAI_API_KEY set.
Pipeline Actions
Execute these actions sequentially to ingest new policy documents.
Execution Log
Retrieval Augmented Generation (RAG) Playground
Registered Public Chat Users
Review account activity, cloud cost, quotas, suspension state, and replay public questions in the admin playground.
Assign Agenda and Minutes documents manually if automatic classification failed.
Scans all PDFs in the corpus to determine the offset between printed footer page numbers and physical PDF page indices. Older documents often have cover/TOC pages that shift the numbering. Running this step before extraction ensures correct page splitting.
This determines the offset between TOC page numbers and physical PDF pages.
Scanning...
Correlation Results
| Year | Meeting | Document | Offset | Confidence | Pages |
|---|---|---|---|---|---|
| Run correlation to see results. | |||||
Extracts the Table of Contents from mapped Agenda PDFs, uses LLM parsing to identify individual business items, and splits the large documents into distributed folders.
Monitor the realtime execution output in the Database Overview Logs.
Preparing Workspace...
Extraction Queue
Extraction Validation
Iterates through Distributed_Corpus to verify physical page numbers and titles.
| Status | Item ID | Issue |
|---|---|---|
| No validation report. | ||
Extracts the Table of Contents from mapped Minutes PDFs, uses the same Regex parsing to identify individual business items, and distributes the minute pages into the existing item folders.
Monitor the realtime execution output in the Database Overview Logs.
Preparing Workspace...
Extraction Queue
Minutes Extraction Validation
Iterates through Distributed_Corpus to verify physical page numbers and titles of extracted minute PDFs.
| Status | Item ID | Issue |
|---|---|---|
| No validation report. | ||