@routineactivity

Record linkage made easy

Merging multiple datasets and gaining a view of unique persons in administrative data is a regular obstacle for data professionals. But with Splink that process can be fast, accurate and scaleable.

AI image Person Link Network

We often need to collate a single-person view in criminal justice and law enforcement. It may be to understand a person’s journey through the system, identify a trajectory or help understand overall risk and harm.

When working with legacy systems, it might mean that a person’s records are spread across multiple databases (arrests, intelligence, crimes, stops). There may not be a unique ID attached to that person. Their name, dates of birth or other identifiers may be affected by inconsistent data entry (different spellings, exclusion or inclusion of middle names, a miskeyed digit in date of birth).

Newer systems that claim a ‘golden thread’ by the presence of a unique identifier across databases are not immune from duplicate creation either. Moving addresses, changing names, and providing incorrect information (knowingly or mistakenly) can lead to duplication. Providing false particulars to obfuscate and evade law enforcement sanctions can also lead to problematic data.

I’d dealt with this previously using different string matching techniques — cosine, Jaccard, soundex and Levensthein. I created a short notebook here using Premier League footballer names and ID numbers comparing these methods.


Different string matching techniques


I wasn’t aware of Splink until last year when a colleague introduced it to me. It is nothing short of amazing. The documentation is great and there are a variety of options that you can walk through over at the Splink GitHub page. Designed to be fast enough to link 100 million records, I was pleased to find I could de-dupe and assign unique identifiers to a sample dataset of over 400,000 records in less than 20 seconds.

I’ve made available a short worked example using Splink’s ‘Quick and Dirty Persons’ model which you can find at this link (data used at this link). Apologies, that is a lot of links so far…

Overview

Further reading


I originally posted this on Medium