How this was built

How gazetteer was built

A portfolio walk-through for a technical reviewer: the architecture, the data model, the design decisions, and the human-directed-AI workflow that produced it. Everything below is a curated, point-in-time snapshot as of v6 (2026-06-16) — it reflects the platform at that milestone and will drift as the project evolves; the vintage label keeps it honest. ← Back to the dashboard · New here? Start with the guide → for a plain-language tour of what the tool is and how to read each view.

Gazetteer integrates free, public US place-based statistics — housing values, cost of living, income, population, jobs, health — so the same place can be compared across the same period, despite sources that natively disagree on grain, coverage, key system, cadence, and measure type. The reconciliation is the point. It runs on a 4-layer medallion architecture (DuckDB locally, portable to Snowflake), with Data Vault 2.0 as the integration layer and a Kimball star as the consumption layer.

Architecture

public sources (Census · BEA · BLS · Zillow · IRS · FHFA) → bronze (data lake, Parquet) → silver (Data Vault 2.0 — spine + measures) → gold (star + PIT + business vault) → platinum (use-case marts) → Evidence dashboards

Side loop (AD-5): platinum → ml/ forecast (read-only) → forecast lands back in gold as a measure.

Mermaid source (verbatim from docs/STATUS.md — renders as a diagram once an SVG is committed)

graph LR
    SRC["public sources<br/>Census · BEA · BLS · Zillow · IRS · FHFA"]
    BRZ["bronze<br/>data lake (Parquet) ✓"]
    SLV["silver<br/>Data Vault 2.0<br/>spine + measures ✓"]
    GLD["gold<br/>star + PIT + business vault ✓"]
    PLT["platinum<br/>use-case marts ✓"]
    BI["Evidence<br/>dashboards ✓"]
    ML["ml/ forecast ✓<br/>(AD-5 round-trip)"]
    SRC --> BRZ --> SLV --> GLD --> PLT --> BI
    PLT -.read-only.-> ML
    ML -.forecast back to vault.-> GLD
    classDef done fill:#238636,stroke:#1a6e2e,color:#fff
    classDef next fill:#9e6a03,stroke:#7a5202,color:#fff
    classDef future fill:#21262d,stroke:#3a3f48,color:#8a8f98
    class SRC,BRZ,SLV,GLD,PLT,BI,ML done

Public sources → bronze (rename/cast only) → silver (Data Vault 2.0: hub, links, satellites — hashing + structure) → gold (Kimball star + business-vault computed satellites — all business logic lives here) → platinum (use-case marts the dashboard reads). ML training reads a mart read-only and lands its forecast back in the vault as just another measure (AD-5, below).

The data model (DV2.0 + Kimball)

Object counts — 21 silver · 8 gold · 2 platinum by type: 1 hub · 3 link · 2 bridge · 15 sat · 4 cs · 1 pit · 2 dim · 1 fact · 2 mart

Silver — raw vault (the Data Vault 2.0 core)

h_geography is the spine; every measure source hangs a satellite off it, and crosswalk links/bridges relate geographies. One hub, three links, two bridges, and fifteen source satellites all key to (geo_type, geoid):

Off the hub	Objects
Hub	`h_geography` (the spine)
Crosswalk links	`l_geography_cbsa` · `l_geography_zcta_county` · `l_migration`
Bridges	`brg_bea_county_fips` · `brg_tract_2020_2010`
Measure satellites	ACS income · BEA income · BEA metro RPP · BEA state RPP · BLS unemployment · FHFA HPI · gazetteer (name/area) · PEP population · PLACES health (county + tract) · tract identity · Zillow ZHVI
Link satellites	CBSA membership · ZCTA→county allocation · migration flows

Mermaid source (verbatim, dbt-manifest-generated — renders as a graph once an SVG is committed)

graph LR
    brg_bea_county_fips["brg_bea_county_fips"]:::bridge
    brg_tract_2020_2010["brg_tract_2020_2010"]:::bridge
    h_geography["h_geography"]:::hub
    l_geography_cbsa["l_geography_cbsa"]:::link
    l_geography_zcta_county["l_geography_zcta_county"]:::link
    l_migration["l_migration"]:::link
    s_geography_acs5_income["s_geography_acs5_income"]:::sat
    s_geography_bea_income["s_geography_bea_income"]:::sat
    s_geography_bea_metro_rpp["s_geography_bea_metro_rpp"]:::sat
    s_geography_bea_state_rpp["s_geography_bea_state_rpp"]:::sat
    s_geography_bls_unemployment["s_geography_bls_unemployment"]:::sat
    s_geography_fhfa_hpi["s_geography_fhfa_hpi"]:::sat
    s_geography_gazetteer["s_geography_gazetteer"]:::sat
    s_geography_pep_population["s_geography_pep_population"]:::sat
    s_geography_places_county["s_geography_places_county"]:::sat
    s_geography_places_tract["s_geography_places_tract"]:::sat
    s_geography_tract["s_geography_tract"]:::sat
    s_geography_zillow_zhvi["s_geography_zillow_zhvi"]:::sat
    s_l_cbsa_membership["s_l_cbsa_membership"]:::sat
    s_l_zcta_county_allocation["s_l_zcta_county_allocation"]:::sat
    s_migration["s_migration"]:::sat
    h_geography --> brg_bea_county_fips
    h_geography --> l_geography_cbsa
    l_geography_cbsa --> s_l_cbsa_membership
    h_geography --> l_geography_zcta_county
    l_geography_zcta_county --> s_l_zcta_county_allocation
    h_geography --> l_migration
    l_migration --> s_migration
    h_geography --> s_geography_acs5_income
    h_geography --> s_geography_bea_income
    h_geography --> s_geography_bea_metro_rpp
    h_geography --> s_geography_bea_state_rpp
    h_geography --> s_geography_bls_unemployment
    h_geography --> s_geography_fhfa_hpi
    h_geography --> s_geography_gazetteer
    h_geography --> s_geography_pep_population
    h_geography --> s_geography_zillow_zhvi
    h_geography --> s_geography_tract
    h_geography --> s_geography_places_tract
    h_geography --> s_geography_places_county
    h_geography --> brg_tract_2020_2010
    classDef hub fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef link fill:#7ed321,stroke:#5a9216,color:#1b3a08
    classDef bridge fill:#7ed321,stroke:#5a9216,color:#1b3a08
    classDef sat fill:#f5a623,stroke:#b9791a,color:#4e342e
    classDef masat fill:#f5a623,stroke:#b9791a,color:#4e342e
    classDef cs fill:#e8923a,stroke:#a85d12,color:#fff
    classDef pit fill:#b07cd6,stroke:#6c3483,color:#fff
    classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
    classDef mart fill:#16a085,stroke:#0e6655,color:#fff
    classDef stg fill:#3a3f48,stroke:#21262d,color:#8a8f98

Gold — star + business vault

The Kimball star (dim_* → fct_place_year) plus the business vault (cs_* computed satellites, pit_place_year the version-resolving spine). Business logic lives here.

Role	Objects
Conformed dimensions	`dim_geography` (Type 1 identity) · `dim_date` (monthly-conformed)
Business vault (computed sats)	`cs_geography_affordability` · `cs_geography_affordability_forecast` (the ML measure) · `cs_geography_cost_adjusted_income` · `cs_migration_netflow`
PIT spine	`pit_place_year` — the (geography × year) spine over the union of all measures' coverage; resolves each measure's active version by equi-join
Fact	`fct_place_year` — driven off the PIT, one row per county-year, every measure as a column

The fact is not pinned to any one source's coverage: the PIT spans the union of years any measure covers, so each measure fills the periods it has and is NULL elsewhere. cs_geography_affordability_forecast is the ML forecast, landed back in the vault as a first-class measure (AD-5).

Mermaid source (verbatim, dbt-manifest-generated)

graph LR
    cs_geography_affordability["cs_geography_affordability"]:::cs
    cs_geography_affordability_forecast["cs_geography_affordability_forecast"]:::cs
    cs_geography_cost_adjusted_income["cs_geography_cost_adjusted_income"]:::cs
    cs_migration_netflow["cs_migration_netflow"]:::cs
    dim_date["dim_date"]:::dim
    dim_geography["dim_geography"]:::dim
    fct_place_year["fct_place_year"]:::fact
    pit_place_year["pit_place_year"]:::pit
    s_geography_acs5_income["s_geography_acs5_income"]:::ext
    s_geography_acs5_income --> cs_geography_affordability
    s_geography_zillow_zhvi["s_geography_zillow_zhvi"]:::ext
    s_geography_zillow_zhvi --> cs_geography_affordability
    s_geography_bea_income["s_geography_bea_income"]:::ext
    s_geography_bea_income --> cs_geography_cost_adjusted_income
    h_geography["h_geography"]:::ext
    h_geography --> cs_geography_cost_adjusted_income
    l_geography_cbsa["l_geography_cbsa"]:::ext
    l_geography_cbsa --> cs_geography_cost_adjusted_income
    s_geography_bea_metro_rpp["s_geography_bea_metro_rpp"]:::ext
    s_geography_bea_metro_rpp --> cs_geography_cost_adjusted_income
    s_geography_bea_state_rpp["s_geography_bea_state_rpp"]:::ext
    s_geography_bea_state_rpp --> cs_geography_cost_adjusted_income
    s_migration["s_migration"]:::ext
    s_migration --> cs_migration_netflow
    l_migration["l_migration"]:::ext
    l_migration --> cs_migration_netflow
    dim_geography --> cs_migration_netflow
    h_geography --> dim_geography
    s_geography_gazetteer["s_geography_gazetteer"]:::ext
    s_geography_gazetteer --> dim_geography
    pit_place_year --> fct_place_year
    s_geography_acs5_income --> fct_place_year
    s_geography_pep_population["s_geography_pep_population"]:::ext
    s_geography_pep_population --> fct_place_year
    s_geography_bea_income --> fct_place_year
    cs_geography_cost_adjusted_income --> fct_place_year
    s_geography_zillow_zhvi --> fct_place_year
    cs_geography_affordability --> fct_place_year
    cs_migration_netflow --> fct_place_year
    s_geography_bls_unemployment["s_geography_bls_unemployment"]:::ext
    s_geography_bls_unemployment --> fct_place_year
    s_geography_fhfa_hpi["s_geography_fhfa_hpi"]:::ext
    s_geography_fhfa_hpi --> fct_place_year
    dim_geography --> fct_place_year
    cs_geography_affordability_forecast --> fct_place_year
    s_geography_acs5_income --> pit_place_year
    s_geography_pep_population --> pit_place_year
    s_geography_bea_income --> pit_place_year
    cs_geography_cost_adjusted_income --> pit_place_year
    cs_geography_affordability --> pit_place_year
    cs_geography_affordability_forecast --> pit_place_year
    cs_migration_netflow --> pit_place_year
    s_geography_fhfa_hpi --> pit_place_year
    s_geography_zillow_zhvi --> pit_place_year
    s_geography_bls_unemployment --> pit_place_year
    classDef hub fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef link fill:#7ed321,stroke:#5a9216,color:#1b3a08
    classDef bridge fill:#7ed321,stroke:#5a9216,color:#1b3a08
    classDef sat fill:#f5a623,stroke:#b9791a,color:#4e342e
    classDef masat fill:#f5a623,stroke:#b9791a,color:#4e342e
    classDef cs fill:#e8923a,stroke:#a85d12,color:#fff
    classDef pit fill:#b07cd6,stroke:#6c3483,color:#fff
    classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
    classDef mart fill:#16a085,stroke:#0e6655,color:#fff
    classDef stg fill:#3a3f48,stroke:#21262d,color:#8a8f98
    classDef ext fill:#1c212b,stroke:#3a3f48,color:#8a8f98,stroke-dasharray:3 3

Platinum — use-case marts

Consumption views the dashboards read — weightable, per-place ingredients. Two marts, each built from fct_place_year + dim_geography:

mart_place_profile — every measure plus its national percentile, per county-year (the dashboard's main feed).
mart_affordability_forecast — the affordability history + 2-year forecast with its prediction band.

Mermaid source (verbatim, dbt-manifest-generated)

graph LR
    mart_affordability_forecast["mart_affordability_forecast"]:::mart
    mart_place_profile["mart_place_profile"]:::mart
    fct_place_year["fct_place_year"]:::ext
    fct_place_year --> mart_affordability_forecast
    dim_geography["dim_geography"]:::ext
    dim_geography --> mart_affordability_forecast
    fct_place_year --> mart_place_profile
    dim_geography --> mart_place_profile
    classDef mart fill:#16a085,stroke:#0e6655,color:#fff
    classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
    classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef ext fill:#1c212b,stroke:#3a3f48,color:#8a8f98,stroke-dasharray:3 3

Why this can't drift. The data-model topology above (and its per-layer Mermaid source) is generated directly from the dbt manifest by a project script — a picture of what is actually built, not a hand-drawn approximation. The project's status, board, and analyst field-guide are generated the same way (from spec frontmatter and dbt .yml metadata). The discipline: a documentation surface that can drift, will — so generate the ones that describe the code from the code.

How this was built — human-directed AI

I architected this platform and directed AI to build it. That is a different thing from "vibe coding" — and the difference is visible in the artifacts, not just asserted. Every feature begins as a spec I author: a plain-language design with a decisions table, where each load-bearing call is tagged [mine] (I decide; the agent executes exactly) or [agent] (I delegate the call to the agent's judgment). The agent builds the spec; I evaluate the delivery against my own design. The receipts below are the ones only a directing human produces.

Receipt 1 — the `[mine]` / `[agent]` decisions table

From the geography-spine feature spec (the project's founding feature). Each row is a design call I made or chose to delegate — recorded before the code existed.

#	Decision	who	the call
D1	The four constitutional architecture decisions	`[mine]`	AD-1..AD-4 — ratified before building
D2	Derive state rows from the county file (no separate ingest)	`[mine]`	Census publishes no standalone state gazetteer file (verified). So state geoid = `left(county.geoid,2)`; area = exact sum of counties; centroid = area-weighted mean; name from a seed.
D3	Satellite hashdiff = name + area; EXCLUDE centroid	`[mine]`	The Census internal point wobbles across vintages without a real boundary change; versioning on it would churn the satellite. Centroid is payload, not a version driver.
D4	A union staging model feeds the satellite	`[agent]→ratified`	`automate_dv.sat()` takes a single source (unlike `hub()`), so the two grains are unioned first. Discovered by the agent at build, reported, and ratified.

Receipt 2 — specs predate the code

The decisions table above existed as a spec before the hub and satellite were built. The working method is spec-driven: no code without an authored spec, and the spec — not the conversation — is the durable record of authorship. The agent's standing contract: "build everything in the Detail section autonomously; execute every [mine] decision as written; use judgment on [agent] ones; if you hit an uncovered load-bearing decision, STOP and ask; verify per the DoD and report reconciliation results." The boundary that keeps ownership honest: an agent may resolve a one-answer data-quality edge and report it; it must STOP on a genuine design fork (a grain choice, a key design, a semantics decision). The line between those is the experiment.

Receipt 3 — the honest ML finding, led with the judgment

The platform includes a predictive measure — a county housing-affordability forecast — built as a leakage-aware backtest that establishes a predictability boundary. The judgment came first: I required every forecast to be backtested against a naive "next year = last year" baseline, with the verdict reported whether or not it flattered the model. The finding (see The ML finding below) is that county-intrinsic affordability is near a random walk at a 1–2 year horizon — a precise, honest negative result, kept as a first-class artifact. A documented "the features don't carry the signal" is a stronger competence signal than a cherry-picked accuracy number.

Receipt 4 — a documented course-correction

The strongest evidence a thinking human was directing the work is a corrected decision, logged. Two examples from the project's decision log:

The PIT correction. My first instinct modeled a fact's reporting grain as a link (a "county-year link"). That was wrong — a link models a relationship that exists in the source; a constructed reporting grain is a Point-in-Time table. Corrected to a PIT, and the reasoning recorded.
The RR-inflation catch. When scoring the roadmap by WSJF, a feature's Risk-Reduction term was caught inflated 9→4 — it was double-counting learning value already in the Value term. Caught, corrected, and turned into a standing anti-pattern note.

The principle. A directed-AI build is credible only with receipts shipped alongside it — a decisions table, specs-before-code, an honest experiment, a logged course-correction. Anyone can claim judgment; these are the artifacts that can only exist if a human was actually directing.

Receipt 5 — the velocity (what directed-AI actually cost in time)

The whole platform — a 4-layer Data Vault 2.0 + Kimball star, ~10 reconciled public sources, an ML forecast with an honest backtest, an interactive dashboard, and generated docs — was built solo, part-time, in roughly two weeks of calendar time. The chart is the raw git record: commits per day, every calendar day in the window (quiet days included — no cherry-picking).

Clustering those commits into work sessions gives a conservative ~38 hours of focused effort across 13 active days — a floor (design time before a commit isn't fully captured). The three peak days account for ~half of it: the long, uninterrupted flow sessions where directed-AI moves fastest. (Full per-day breakdown: the Status page's activity timeline.)

Design decisions (the constitution)

Five cross-cutting decisions govern the whole platform — ratified in a spec before the code that assumes them.

AD-1 — One geography hub, type-qualified key (geo_type, geoid). A single h_geography hub for every geographic level, not separate hubs per grain. A bare 5-digit value collides across types (county 06075 ≠ place ≠ ZCTA ≠ CBSA), so geo_type is part of the business key — the one sanctioned discriminator-in-key (different geo_type = a genuinely different real-world entity).
AD-2 / AD-2a — Hash key order fixed forever, enforced two ways. hk_geography = automate_dv.hash(['geo_type','geoid']), the column order fixed permanently. Enforced by prevention (a macro owns the order so you can't author it wrong) and detection (a relationships FK test from every consumer back to the hub, so you can't ship it wrong).
AD-3 — Epoch / effectivity lives on the crosswalk links, not the hub. The hub is a timeless set of identities; which county belongs to which CBSA in which delineation vintage is an effectivity property of the link. Pin the epoch to the data's reference year so footprints never silently mix.
AD-4 — Apportionment stores raw + factor for non-nesting geographies. ZCTA↔county doesn't nest, so the allocation factor is stored on the crosswalk link satellite — never pre-allocated. Apportionment stays auditable and re-derivable. (Auditability > convenience.)
AD-5 — The ML boundary: train downstream, land predictions back in the vault. Training is a downstream ml/ script reading a mart read-only; predictions re-enter the vault as a cs_*_forecast measure, backtested with error as first-class metadata. The app never sees the model — only its output, conformed to the star like any other measure.

The rule underneath all of it: all business logic lives in the vault / dbt — the app is dials only, zero business logic. There is one place to go for lineage, impact analysis, and evaluation; the dashboard can't grow a shadow business layer.

The ML finding

The question, framed as an experiment: can free, public county-level data forecast where housing affordability is heading — and how well? The deliverable is the honest verdict, not a flattering number.

Two models were tried behind the same forecast measure: a per-county time-series model (v3), then a pooled cross-county gradient-boosted model trained across all counties (v6). Both were graded against a persistence baseline ("next year = last year") on a held-out, leakage-free split.

model	overall MAE	skill vs persistence
pooled GBT (v6)	0.2963	−0.31
per-county Holt (v3)	0.3162	−0.52
persistence (the bar)	0.2262	—

The finding: pooling beat the per-county model (+0.06 skill — pooling specifically helped) but neither beat persistence. County-intrinsic affordability is near a random walk at a 1–2 year horizon — the signal isn't in the county-level features; the rest is macro. This is reported as a first-class result, not hidden. The next architecture, recorded for the next model: predict the persistence residual (forecast = last_value + model(Δ)) so the model corrects inertia rather than re-learning a level it can't beat.

The split is leak-free by construction, not by inspection — every feature is anchored at year A, the target at A + horizon, with a runtime assertion that no training feature comes from a year later than the cutoff. That a backtest can be trusted is the actual deliverable here.

The reasoning trail

An append-only log records the significant decisions and corrected calls across the project's life — the connective tissue between the formal architecture decisions and the per-version retros. A few entries, prioritizing the ones that were later corrected (real reasoning leaves a trail; polished hindsight doesn't):

2026-06-12 — A fact's grain is a PIT, not a link. First instinct: a "county-year link." Corrected: a link models a source relationship; a constructed reporting grain is a Point-in-Time table. fct_place_year drives off pit_place_year, resolving each measure's active version by equi-join, not window functions.
2026-06-15 — WSJF adopted, with the RR-inflation lesson. Ranking the roadmap by WSJF surfaced an anti-pattern: a Risk-Reduction term inflated 9→4, double-counting learning value already in Value. Caught and turned into a standing note.
2026-06-15 — Spec distillation made an explicit version-close step. It was intended per-feature but silently skipped for several versions (moving a file to completed/ ≠ distilling it) — caught, and promoted to a required close step.
2026-06-13 — AD-5, the ML boundary. Ratified when the first forecast was scoped: keeps ML consistent with logic-in-the-vault without pretending training is a transformation.
2026-06-08 — AD-1..4, the geography spine. The founding architecture — one hub, type-qualified key, hash order fixed forever, epoch on the links, apportionment stored raw. Superseded the per-grain-hubs alternative (which would force N hubs + N(N−1) nesting links) and the legacy habit of baking source into keys.

Snapshot as of v6 — 2026-06-16. The diagrams and counts above are a point-in-time view; the project keeps evolving. ← Back to Find Your Happy Place