Knight Frank Data Platform

The problem

Knight Frank’s analysts needed to cross-reference roughly 7.4M registered leasehold records across England and Wales: who owns what, when leases expire, and how each property looks once you add floor area, energy rating and rateable value. The source data sits in separate public and licensed datasets (HM Land Registry leases, CCOD/OCOD company ownership, EPC, VOA, ONS geography, postcodes), in different shapes, refreshed on different schedules. None of it lines up on its own.

What I built

I authored the architecture and led delivery on a small team. A colleague co-developed the platform with me; I owned the data-platform design and the lease-expiry capability end to end, from pipeline through API to the map.

A repeatable ingestion backbone. Bruin-orchestrated Python and SQL pipelines load each dataset the same way into a PostgreSQL medallion warehouse (raw to staging to canonical), with SCD2 history, monthly full snapshots reconciled against change-only deltas, and data-quality gates that block stale or failing data before it reaches the warehouse. New datasets onboard through the same pattern rather than as one-off scripts.
A lease matcher that resolves entities deterministically. It reconciles 7.4M leases against 2.9M proprietor records using five labelled address-and-postcode strategies, then enriches leases against EPC and VOA via UPRN-exact joins plus a postcode-scoped token matcher. A single shared address-pair source of truth means the matcher and the property-page lookup cannot drift apart.
A five-tier national map. Region, council and LSOA choropleths, clustered points, then individual properties, served as MapLibre vector tiles by a Martin tile server reading an Azure Postgres read replica. The per-tile payload stays near-constant from country view down to street level.

The hard part

The token matcher. UPRN-exact joins only get you so far, so I built an inverted index keyed on (postcode, address-first-line token): postcode blocks the search, shared address tokens decide the match, and a match needs at least one real token (postcode-only is excluded). Numeric building ranges expand (52-56 becomes {52..56}); street words under four characters and a stopword set drop out. It evaluates on the order of hundreds of millions of candidate pairs and scores each match high, medium or low. There were no labelled ground-truth pairs, so I used confidence bands and spot-checks rather than claiming precision figures I could not stand behind.

Outcome

The platform’s patterns were proved with two stakeholder apps: a sanctions-screening tool first (deliberately, a small clean dataset to validate the ELT backbone), then the larger lease-expiry app. The first production MVP of the lease app landed in about three months. Off free public data, lease-to-owner matching reached the leases the source data actually supports, and combined commercial lease coverage from EPC and VOA enrichment came out around 66%.

One note on the sanctions app: its AI-generated reports use Azure OpenAI, but the matching engine itself is Yente and OpenSearch, not an LLM. The screening decisions are deterministic and explainable, which is the point.