How this was built
How gazetteer was built
A portfolio walk-through for a technical reviewer: the architecture, the data model, the design decisions, and the human-directed-AI workflow that produced it. Everything below is a curated, point-in-time snapshot as of v6 (2026-06-16) — it reflects the platform at that milestone and will drift as the project evolves; the vintage label keeps it honest. ← Back to the dashboard · New here? Start with the guide → for a plain-language tour of what the tool is and how to read each view.
Gazetteer integrates free, public US place-based statistics — housing values, cost of living, income, population, jobs, health — so the same place can be compared across the same period, despite sources that natively disagree on grain, coverage, key system, cadence, and measure type. The reconciliation is the point. It runs on a 4-layer medallion architecture (DuckDB locally, portable to Snowflake), with Data Vault 2.0 as the integration layer and a Kimball star as the consumption layer.
Architecture
public sources (Census · BEA · BLS · Zillow · IRS · FHFA) → bronze (data lake, Parquet) → silver (Data Vault 2.0 — spine + measures) → gold (star + PIT + business vault) → platinum (use-case marts) → Evidence dashboards
Side loop (AD-5): platinum → ml/ forecast (read-only) → forecast lands back in gold as a measure.
Mermaid source (verbatim from docs/STATUS.md — renders as a diagram once an SVG is committed)
graph LR
SRC["public sources<br/>Census · BEA · BLS · Zillow · IRS · FHFA"]
BRZ["bronze<br/>data lake (Parquet) ✓"]
SLV["silver<br/>Data Vault 2.0<br/>spine + measures ✓"]
GLD["gold<br/>star + PIT + business vault ✓"]
PLT["platinum<br/>use-case marts ✓"]
BI["Evidence<br/>dashboards ✓"]
ML["ml/ forecast ✓<br/>(AD-5 round-trip)"]
SRC --> BRZ --> SLV --> GLD --> PLT --> BI
PLT -.read-only.-> ML
ML -.forecast back to vault.-> GLD
classDef done fill:#238636,stroke:#1a6e2e,color:#fff
classDef next fill:#9e6a03,stroke:#7a5202,color:#fff
classDef future fill:#21262d,stroke:#3a3f48,color:#8a8f98
class SRC,BRZ,SLV,GLD,PLT,BI,ML donePublic sources → bronze (rename/cast only) → silver (Data Vault 2.0: hub, links, satellites — hashing + structure) → gold (Kimball star + business-vault computed satellites — all business logic lives here) → platinum (use-case marts the dashboard reads). ML training reads a mart read-only and lands its forecast back in the vault as just another measure (AD-5, below).
The data model (DV2.0 + Kimball)
Object counts — 21 silver · 8 gold · 2 platinum by type: 1 hub · 3 link · 2 bridge · 15 sat · 4 cs · 1 pit · 2 dim · 1 fact · 2 mart
Silver — raw vault (the Data Vault 2.0 core)
h_geography is the spine; every measure source hangs a satellite off it, and
crosswalk links/bridges relate geographies. One hub, three links, two bridges, and
fifteen source satellites all key to (geo_type, geoid):
| Off the hub | Objects |
|---|---|
| Hub | h_geography (the spine) |
| Crosswalk links | l_geography_cbsa · l_geography_zcta_county · l_migration |
| Bridges | brg_bea_county_fips · brg_tract_2020_2010 |
| Measure satellites | ACS income · BEA income · BEA metro RPP · BEA state RPP · BLS unemployment · FHFA HPI · gazetteer (name/area) · PEP population · PLACES health (county + tract) · tract identity · Zillow ZHVI |
| Link satellites | CBSA membership · ZCTA→county allocation · migration flows |
Mermaid source (verbatim, dbt-manifest-generated — renders as a graph once an SVG is committed)
graph LR
brg_bea_county_fips["brg_bea_county_fips"]:::bridge
brg_tract_2020_2010["brg_tract_2020_2010"]:::bridge
h_geography["h_geography"]:::hub
l_geography_cbsa["l_geography_cbsa"]:::link
l_geography_zcta_county["l_geography_zcta_county"]:::link
l_migration["l_migration"]:::link
s_geography_acs5_income["s_geography_acs5_income"]:::sat
s_geography_bea_income["s_geography_bea_income"]:::sat
s_geography_bea_metro_rpp["s_geography_bea_metro_rpp"]:::sat
s_geography_bea_state_rpp["s_geography_bea_state_rpp"]:::sat
s_geography_bls_unemployment["s_geography_bls_unemployment"]:::sat
s_geography_fhfa_hpi["s_geography_fhfa_hpi"]:::sat
s_geography_gazetteer["s_geography_gazetteer"]:::sat
s_geography_pep_population["s_geography_pep_population"]:::sat
s_geography_places_county["s_geography_places_county"]:::sat
s_geography_places_tract["s_geography_places_tract"]:::sat
s_geography_tract["s_geography_tract"]:::sat
s_geography_zillow_zhvi["s_geography_zillow_zhvi"]:::sat
s_l_cbsa_membership["s_l_cbsa_membership"]:::sat
s_l_zcta_county_allocation["s_l_zcta_county_allocation"]:::sat
s_migration["s_migration"]:::sat
h_geography --> brg_bea_county_fips
h_geography --> l_geography_cbsa
l_geography_cbsa --> s_l_cbsa_membership
h_geography --> l_geography_zcta_county
l_geography_zcta_county --> s_l_zcta_county_allocation
h_geography --> l_migration
l_migration --> s_migration
h_geography --> s_geography_acs5_income
h_geography --> s_geography_bea_income
h_geography --> s_geography_bea_metro_rpp
h_geography --> s_geography_bea_state_rpp
h_geography --> s_geography_bls_unemployment
h_geography --> s_geography_fhfa_hpi
h_geography --> s_geography_gazetteer
h_geography --> s_geography_pep_population
h_geography --> s_geography_zillow_zhvi
h_geography --> s_geography_tract
h_geography --> s_geography_places_tract
h_geography --> s_geography_places_county
h_geography --> brg_tract_2020_2010
classDef hub fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef link fill:#7ed321,stroke:#5a9216,color:#1b3a08
classDef bridge fill:#7ed321,stroke:#5a9216,color:#1b3a08
classDef sat fill:#f5a623,stroke:#b9791a,color:#4e342e
classDef masat fill:#f5a623,stroke:#b9791a,color:#4e342e
classDef cs fill:#e8923a,stroke:#a85d12,color:#fff
classDef pit fill:#b07cd6,stroke:#6c3483,color:#fff
classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
classDef mart fill:#16a085,stroke:#0e6655,color:#fff
classDef stg fill:#3a3f48,stroke:#21262d,color:#8a8f98Gold — star + business vault
The Kimball star (dim_* → fct_place_year) plus the business vault (cs_*
computed satellites, pit_place_year the version-resolving spine). Business
logic lives here.
| Role | Objects |
|---|---|
| Conformed dimensions | dim_geography (Type 1 identity) · dim_date (monthly-conformed) |
| Business vault (computed sats) | cs_geography_affordability · cs_geography_affordability_forecast (the ML measure) · cs_geography_cost_adjusted_income · cs_migration_netflow |
| PIT spine | pit_place_year — the (geography × year) spine over the union of all measures' coverage; resolves each measure's active version by equi-join |
| Fact | fct_place_year — driven off the PIT, one row per county-year, every measure as a column |
The fact is not pinned to any one source's coverage: the PIT spans the union of
years any measure covers, so each measure fills the periods it has and is NULL
elsewhere. cs_geography_affordability_forecast is the ML forecast, landed back in
the vault as a first-class measure (AD-5).
Mermaid source (verbatim, dbt-manifest-generated)
graph LR
cs_geography_affordability["cs_geography_affordability"]:::cs
cs_geography_affordability_forecast["cs_geography_affordability_forecast"]:::cs
cs_geography_cost_adjusted_income["cs_geography_cost_adjusted_income"]:::cs
cs_migration_netflow["cs_migration_netflow"]:::cs
dim_date["dim_date"]:::dim
dim_geography["dim_geography"]:::dim
fct_place_year["fct_place_year"]:::fact
pit_place_year["pit_place_year"]:::pit
s_geography_acs5_income["s_geography_acs5_income"]:::ext
s_geography_acs5_income --> cs_geography_affordability
s_geography_zillow_zhvi["s_geography_zillow_zhvi"]:::ext
s_geography_zillow_zhvi --> cs_geography_affordability
s_geography_bea_income["s_geography_bea_income"]:::ext
s_geography_bea_income --> cs_geography_cost_adjusted_income
h_geography["h_geography"]:::ext
h_geography --> cs_geography_cost_adjusted_income
l_geography_cbsa["l_geography_cbsa"]:::ext
l_geography_cbsa --> cs_geography_cost_adjusted_income
s_geography_bea_metro_rpp["s_geography_bea_metro_rpp"]:::ext
s_geography_bea_metro_rpp --> cs_geography_cost_adjusted_income
s_geography_bea_state_rpp["s_geography_bea_state_rpp"]:::ext
s_geography_bea_state_rpp --> cs_geography_cost_adjusted_income
s_migration["s_migration"]:::ext
s_migration --> cs_migration_netflow
l_migration["l_migration"]:::ext
l_migration --> cs_migration_netflow
dim_geography --> cs_migration_netflow
h_geography --> dim_geography
s_geography_gazetteer["s_geography_gazetteer"]:::ext
s_geography_gazetteer --> dim_geography
pit_place_year --> fct_place_year
s_geography_acs5_income --> fct_place_year
s_geography_pep_population["s_geography_pep_population"]:::ext
s_geography_pep_population --> fct_place_year
s_geography_bea_income --> fct_place_year
cs_geography_cost_adjusted_income --> fct_place_year
s_geography_zillow_zhvi --> fct_place_year
cs_geography_affordability --> fct_place_year
cs_migration_netflow --> fct_place_year
s_geography_bls_unemployment["s_geography_bls_unemployment"]:::ext
s_geography_bls_unemployment --> fct_place_year
s_geography_fhfa_hpi["s_geography_fhfa_hpi"]:::ext
s_geography_fhfa_hpi --> fct_place_year
dim_geography --> fct_place_year
cs_geography_affordability_forecast --> fct_place_year
s_geography_acs5_income --> pit_place_year
s_geography_pep_population --> pit_place_year
s_geography_bea_income --> pit_place_year
cs_geography_cost_adjusted_income --> pit_place_year
cs_geography_affordability --> pit_place_year
cs_geography_affordability_forecast --> pit_place_year
cs_migration_netflow --> pit_place_year
s_geography_fhfa_hpi --> pit_place_year
s_geography_zillow_zhvi --> pit_place_year
s_geography_bls_unemployment --> pit_place_year
classDef hub fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef link fill:#7ed321,stroke:#5a9216,color:#1b3a08
classDef bridge fill:#7ed321,stroke:#5a9216,color:#1b3a08
classDef sat fill:#f5a623,stroke:#b9791a,color:#4e342e
classDef masat fill:#f5a623,stroke:#b9791a,color:#4e342e
classDef cs fill:#e8923a,stroke:#a85d12,color:#fff
classDef pit fill:#b07cd6,stroke:#6c3483,color:#fff
classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
classDef mart fill:#16a085,stroke:#0e6655,color:#fff
classDef stg fill:#3a3f48,stroke:#21262d,color:#8a8f98
classDef ext fill:#1c212b,stroke:#3a3f48,color:#8a8f98,stroke-dasharray:3 3Platinum — use-case marts
Consumption views the dashboards read — weightable, per-place ingredients. Two
marts, each built from fct_place_year + dim_geography:
mart_place_profile— every measure plus its national percentile, per county-year (the dashboard's main feed).mart_affordability_forecast— the affordability history + 2-year forecast with its prediction band.
Mermaid source (verbatim, dbt-manifest-generated)
graph LR
mart_affordability_forecast["mart_affordability_forecast"]:::mart
mart_place_profile["mart_place_profile"]:::mart
fct_place_year["fct_place_year"]:::ext
fct_place_year --> mart_affordability_forecast
dim_geography["dim_geography"]:::ext
dim_geography --> mart_affordability_forecast
fct_place_year --> mart_place_profile
dim_geography --> mart_place_profile
classDef mart fill:#16a085,stroke:#0e6655,color:#fff
classDef fact fill:#c0392b,stroke:#7b241c,color:#fff
classDef dim fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef ext fill:#1c212b,stroke:#3a3f48,color:#8a8f98,stroke-dasharray:3 3Why this can't drift. The data-model topology above (and its per-layer Mermaid source) is generated directly from the dbt manifest by a project script — a picture of what is actually built, not a hand-drawn approximation. The project's status, board, and analyst field-guide are generated the same way (from spec frontmatter and dbt
.ymlmetadata). The discipline: a documentation surface that can drift, will — so generate the ones that describe the code from the code.
How this was built — human-directed AI
I architected this platform and directed AI to build it. That is a different
thing from "vibe coding" — and the difference is visible in the artifacts, not
just asserted. Every feature begins as a spec I author: a plain-language
design with a decisions table, where each load-bearing call is tagged [mine]
(I decide; the agent executes exactly) or [agent] (I delegate the call to
the agent's judgment). The agent builds the spec; I evaluate the delivery against
my own design. The receipts below are the ones only a directing human produces.
Receipt 1 — the [mine] / [agent] decisions table
From the geography-spine feature spec (the project's founding feature). Each row is a design call I made or chose to delegate — recorded before the code existed.
| # | Decision | who | the call |
|---|---|---|---|
| D1 | The four constitutional architecture decisions | [mine] | AD-1..AD-4 — ratified before building |
| D2 | Derive state rows from the county file (no separate ingest) | [mine] | Census publishes no standalone state gazetteer file (verified). So state geoid = left(county.geoid,2); area = exact sum of counties; centroid = area-weighted mean; name from a seed. |
| D3 | Satellite hashdiff = name + area; EXCLUDE centroid | [mine] | The Census internal point wobbles across vintages without a real boundary change; versioning on it would churn the satellite. Centroid is payload, not a version driver. |
| D4 | A union staging model feeds the satellite | [agent]→ratified | automate_dv.sat() takes a single source (unlike hub()), so the two grains are unioned first. Discovered by the agent at build, reported, and ratified. |
Receipt 2 — specs predate the code
The decisions table above existed as a spec before the hub and satellite were
built. The working method is spec-driven: no code without an authored spec,
and the spec — not the conversation — is the durable record of authorship. The
agent's standing contract: "build everything in the Detail section
autonomously; execute every [mine] decision as written; use judgment on
[agent] ones; if you hit an uncovered load-bearing decision, STOP and ask;
verify per the DoD and report reconciliation results." The boundary that keeps
ownership honest: an agent may resolve a one-answer data-quality edge and
report it; it must STOP on a genuine design fork (a grain choice, a key
design, a semantics decision). The line between those is the experiment.
Receipt 3 — the honest ML finding, led with the judgment
The platform includes a predictive measure — a county housing-affordability forecast — built as a leakage-aware backtest that establishes a predictability boundary. The judgment came first: I required every forecast to be backtested against a naive "next year = last year" baseline, with the verdict reported whether or not it flattered the model. The finding (see The ML finding below) is that county-intrinsic affordability is near a random walk at a 1–2 year horizon — a precise, honest negative result, kept as a first-class artifact. A documented "the features don't carry the signal" is a stronger competence signal than a cherry-picked accuracy number.
Receipt 4 — a documented course-correction
The strongest evidence a thinking human was directing the work is a corrected decision, logged. Two examples from the project's decision log:
- The PIT correction. My first instinct modeled a fact's reporting grain as a link (a "county-year link"). That was wrong — a link models a relationship that exists in the source; a constructed reporting grain is a Point-in-Time table. Corrected to a PIT, and the reasoning recorded.
- The RR-inflation catch. When scoring the roadmap by WSJF, a feature's Risk-Reduction term was caught inflated 9→4 — it was double-counting learning value already in the Value term. Caught, corrected, and turned into a standing anti-pattern note.
The principle. A directed-AI build is credible only with receipts shipped alongside it — a decisions table, specs-before-code, an honest experiment, a logged course-correction. Anyone can claim judgment; these are the artifacts that can only exist if a human was actually directing.
Receipt 5 — the velocity (what directed-AI actually cost in time)
The whole platform — a 4-layer Data Vault 2.0 + Kimball star, ~10 reconciled public sources, an ML forecast with an honest backtest, an interactive dashboard, and generated docs — was built solo, part-time, in roughly two weeks of calendar time. The chart is the raw git record: commits per day, every calendar day in the window (quiet days included — no cherry-picking).
Clustering those commits into work sessions gives a conservative ~38 hours of focused effort across 13 active days — a floor (design time before a commit isn't fully captured). The three peak days account for ~half of it: the long, uninterrupted flow sessions where directed-AI moves fastest. (Full per-day breakdown: the Status page's activity timeline.)
Design decisions (the constitution)
Five cross-cutting decisions govern the whole platform — ratified in a spec before the code that assumes them.
- AD-1 — One geography hub, type-qualified key
(geo_type, geoid). A singleh_geographyhub for every geographic level, not separate hubs per grain. A bare 5-digit value collides across types (county06075≠ place ≠ ZCTA ≠ CBSA), sogeo_typeis part of the business key — the one sanctioned discriminator-in-key (differentgeo_type= a genuinely different real-world entity). - AD-2 / AD-2a — Hash key order fixed forever, enforced two ways.
hk_geography = automate_dv.hash(['geo_type','geoid']), the column order fixed permanently. Enforced by prevention (a macro owns the order so you can't author it wrong) and detection (a relationships FK test from every consumer back to the hub, so you can't ship it wrong). - AD-3 — Epoch / effectivity lives on the crosswalk links, not the hub. The hub is a timeless set of identities; which county belongs to which CBSA in which delineation vintage is an effectivity property of the link. Pin the epoch to the data's reference year so footprints never silently mix.
- AD-4 — Apportionment stores raw + factor for non-nesting geographies. ZCTA↔county doesn't nest, so the allocation factor is stored on the crosswalk link satellite — never pre-allocated. Apportionment stays auditable and re-derivable. (Auditability > convenience.)
- AD-5 — The ML boundary: train downstream, land predictions back in the
vault. Training is a downstream
ml/script reading a mart read-only; predictions re-enter the vault as acs_*_forecastmeasure, backtested with error as first-class metadata. The app never sees the model — only its output, conformed to the star like any other measure.
The rule underneath all of it: all business logic lives in the vault / dbt — the app is dials only, zero business logic. There is one place to go for lineage, impact analysis, and evaluation; the dashboard can't grow a shadow business layer.
The ML finding
The question, framed as an experiment: can free, public county-level data forecast where housing affordability is heading — and how well? The deliverable is the honest verdict, not a flattering number.
Two models were tried behind the same forecast measure: a per-county time-series model (v3), then a pooled cross-county gradient-boosted model trained across all counties (v6). Both were graded against a persistence baseline ("next year = last year") on a held-out, leakage-free split.
| model | overall MAE | skill vs persistence |
|---|---|---|
| pooled GBT (v6) | 0.2963 | −0.31 |
| per-county Holt (v3) | 0.3162 | −0.52 |
| persistence (the bar) | 0.2262 | — |
The finding: pooling beat the per-county model (+0.06 skill — pooling
specifically helped) but neither beat persistence. County-intrinsic
affordability is near a random walk at a 1–2 year horizon — the signal isn't
in the county-level features; the rest is macro. This is reported as a
first-class result, not hidden. The next architecture, recorded for the next
model: predict the persistence residual (forecast = last_value + model(Δ)) so
the model corrects inertia rather than re-learning a level it can't beat.
The split is leak-free by construction, not by inspection — every feature is anchored at year A, the target at A + horizon, with a runtime assertion that no training feature comes from a year later than the cutoff. That a backtest can be trusted is the actual deliverable here.
The reasoning trail
An append-only log records the significant decisions and corrected calls across the project's life — the connective tissue between the formal architecture decisions and the per-version retros. A few entries, prioritizing the ones that were later corrected (real reasoning leaves a trail; polished hindsight doesn't):
- 2026-06-12 — A fact's grain is a PIT, not a link. First instinct: a
"county-year link." Corrected: a link models a source relationship; a
constructed reporting grain is a Point-in-Time table.
fct_place_yeardrives offpit_place_year, resolving each measure's active version by equi-join, not window functions. - 2026-06-15 — WSJF adopted, with the RR-inflation lesson. Ranking the roadmap by WSJF surfaced an anti-pattern: a Risk-Reduction term inflated 9→4, double-counting learning value already in Value. Caught and turned into a standing note.
- 2026-06-15 — Spec distillation made an explicit version-close step. It was
intended per-feature but silently skipped for several versions (moving a file
to
completed/≠ distilling it) — caught, and promoted to a required close step. - 2026-06-13 — AD-5, the ML boundary. Ratified when the first forecast was scoped: keeps ML consistent with logic-in-the-vault without pretending training is a transformation.
- 2026-06-08 — AD-1..4, the geography spine. The founding architecture — one hub, type-qualified key, hash order fixed forever, epoch on the links, apportionment stored raw. Superseded the per-grain-hubs alternative (which would force N hubs + N(N−1) nesting links) and the legacy habit of baking source into keys.
Snapshot as of v6 — 2026-06-16. The diagrams and counts above are a point-in-time view; the project keeps evolving. ← Back to Find Your Happy Place
