The first phase of the matching algorithm, Database Preparation, standardizes the coding of database fields to facilitate matching. Different databases store the fields in different ways. The database preparation in the algorithm standardizes the coding of dates of birth, genders, addresses and names, and identifies invalid or missing values such as for Social Security Numbers. Invalid identifiers are discovered by assessing frequencies of duplicates by identifier. As stated above, an identifier like month-day-year of birth ought to be distributed predictably across a population.
Analysis of duplicates also reveals completely duplicate cases. In the Texas voter file, we identified approximately two hundred records in which all information was identical.
Standardization of addresses involved extracting just the ZIP5 and street number fields. ZIP5 and street numbers are useful components of an address field because they are numeric and they are stored similarly across databases. Other components of addresses, like street names, street suffix, and apartment number pose difficulties because they are stored differently in different databases e. First ST, etc. Standardization of Gender coded all indicators of Male to 0 and all indicators of Female to 1.
Standardization of names involved the most attention to subtle details. The latter offered a higher level of uniqueness but also more false negatives owing to variations in first names, especially due to nicknames. Rather than use a name dictionary for first names and nicknames, we simply altered the identification fields.
Standardization of last names required several procedures. First, all names were converted to uppercase letters.
Second, the algorithm removed all apostrophes, hyphens, and other markings. A variant on the name matches accounted for variations in usage of hyphenated or compound last names, such as maiden names used as middle names. In addition to attempting matches on last names, we broke each hyphenated name into two parts and attempted to find matching records using each name. Third, all spaces in last names were removed. The second part of the algorithm develops multiple identifiers for purposes of record linkage. The algorithm builds identifiers by combining fields related to Address, Date of Birth, Gender, and Name.
Each identifier corresponds to a particular combination of fields. View all notes Importantly, each of these indicators is nearly always uniquely identifying, so long as data are not missing in the component parts. The calculations shown here are similar to ones we performed based on the original Texas file, but these come from the voter list obtained from Texas. The figure shows 17 combinations of components of A-G-D-N, some of which we did use in matching and some of which we did not.
There are a number of potential combinations that are not unique to individuals. This is important because it means that if a person is identified in another database with the same combination, it is very unlikely that this is a different person from the one we are trying to link. Figure 1. That is, a match was said to exist for each record in TEAM for which one or more records could be found in a corresponding database.
We believe in data transparency.
We utilized one-to-many matches because the matching databases particularly the motor vehicle databases contain multiple records per person. Multiple sweeps provide a guard against false negatives arising due to typographical errors, missing fields, or inconsistencies. For example, a person may have one last name in one database but another last name in another database, say because of a typo or a name change.
He or she would be matched on Address, Data of Birth, and Gender. The algorithm will match the record on the identifiers that do not contain each of these categories of fields, thus avoiding nonmatches due to typographical errors, nicknames, missing fields, and other inconsistencies between databases.
Find email addresses in seconds • Hunter (Email Hunter)
A record is determined to have found a match if a given identifier in TEAM is identical to at least one corresponding identifier in an identification database. Combinations of fields used as matching identifiers, details. In addition, the implementation of the matching algorithm sought to find additional matches using a set of Secondary Sweeps performed on the TEAM records not matched in the Primary Sweeps. The secondary matches include additional variants, such as using middle initials, and matching separately on components of compound surnames. For Federal databases, the Primary Sweeps are run against all qualifying Federal records with Texas addresses, while the Secondary Sweeps are run both against Texas-only records, as well as against the nationwide universe of the relevant Federal dataset.
By using multiple identifiers, the algorithm is developed to be sensitive to variations in names, such as nicknames and compound names, to typographical errors, and to missing information. By matching on identifiers constructed from a larger number of categories of fields three or four , the algorithm exhausts all possible linkages among the identifiers that have a high likelihood of finding unique matches.
For the federal databases, we converted the algorithm to SQL. They, in turn, produced for us the TEAM file appended with information about matches and non-matches on each matching sweep.
The fourth phase of the matching process is the Data Gathering phase. The results of all matching sweeps are recorded for each individual TEAM record. Most records matched on all or almost all indicators, but some only matched on one or two indicators. Individuals who matched on only one or two indicators tended to be those who had a typo or a change in one of the identifiers. What about individuals who matched to two different records on two different matches in the same database? We counted this person as having a matching record, but such an occurrence is exceedingly rare.
Thus, two different sweeps will rarely match someone to two different records on a database.
Before we began
On account of security, processing had to be done on a local machine. We, as well as the federal agencies responsible for some of the matching, had a variety of experiences with the time it took to process the matches. The first implementation of the algorithm was performed in as part of the case Texas v. Some federal agencies using SQL also reported hours-long computing time.
When conducting the match in the context of the case, Veasey v. Perry , we upgraded software to 8 core STATA from 2 core on a computer with processors to accommodate. The computer performed one iteration of the matching algorithm in less than 30 min, compared with several hours.
SSN Search & People Search
That improvement in speed was critical to be able to validate each step of the algorithm and to train the algorithm to catch any errors, trap special cases, and measure performance of the matching routine. The implementation of the algorithm developed for the United States in this case matched the entire TEAM database to 10 different states and federal databases.
Table 3. Records matched and not matched to State and Federal databases, excluding deceased. Of the 13,, records in the TEAM database, 12,, matched to at least one record corresponding to acceptable photo ID issued by the State of Texas, and 6,, records matched to at least one record corresponding to acceptable photo ID issued by the Federal government.
By acceptable, we mean acceptable according to the particulars of the Texas law. Most of the records matched to the Federal databases also matched to a State of Texas identification database. For , records on the TEAM database approximately 4.
Hence, 4. That is, to test the validity of the Primary Matching algorithm, we conducted those matches for cases with SSN9. We further examined the set of cases for which there was a primary match using Address, Date of Birth, Gender, and Name. By comparison, SSN9, which is often relied on as a unique identifier, could match In other words, the primary matches on combinations of Address, Date of Birth, Gender, and Name are almost the functional equivalent to matching on SSN9.
It is, for example, comparable to the rates reported in the article of Professors Hood and Bullock on the Georgia ID law for 1. The false positive rate is not as low as the 0. Suppose this same person is listed with a drivers license as Jonathan Jonas at First Street. However, because of the discrepancies on the indicators, an election official might disqualify him from voting. For this reason, our estimates of false positive and false negative rates should be taken as upper bounds for the truth rates of error.
ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender, and Name
In data such as these, it is always difficult to gauge accurately the rates of true and false negatives and the rates of true and false positives. Table 6. Again, these error rates should be taken as over-estimates as they may include many people for whom no SSN match was possible, say because of errors in the SSN field, or who have multiple errors in the ADGN fields.