The scale of the LIFE-M project makes hand linking of all records cost and time prohibitive. The LIFE-M project exploits advances in record-linkage to create linked longitudinal samples of families and individuals. Building on the IPUMS historical census linking methodology (Goeken et al. 2011), the LIFE-M approach relies on machine learning calibrated with newly hand-linked data. There are three steps in the process:
- systematically hand linking records for each linking type;
- refining and developing new linking models for each individual link type; and
- implementing cutting-edge computational methods to link records on a very large scale.
(1) Hand-Linking Data
The lack of ground truth for historical data has been a central challenge for assessing the performance of widely used methods and training machine models. As a first step in addressing this critical need for high-quality, historical ground truth, the LIFE-M project created large-scale, highly vetted hand-linked samples. While these data are not “ground truth” in the purest sense of the term, they were created to mimic this standard. Trained reviewers hand linked thousands of birth records to other vital and census records. The table below presents the results of this careful and systematic process.
Link Rates and Number of Hand Links for G2
Notes: Link rates are in bold, and the number of links and linkable people are below the link rates. Link rates are calculated as the number of links divided by the number of linkable people in the training sample. Linkable people for G2-siblings link are the G2 who were in the training sample (for Ohio, a random sample of people born between 1909 and 1920; for North Carolina, a random sample of G2 born between 1915 and 1919) and have non-missing parents’ names (father’s first name, last name, and mother’s first name). Linkable people for G2-1940 Census link are the G2 who were in the training sample, born by 1940, and have non-missing names (both first and last names). Linkable people for G2-death, G2-marriage, and G2-child links are the G2 who are in the training sample and have non-missing names. The number of links and linkable people for Males and Females may not sum up to the total number (All) because there are observations of unknown sex.
(2) Refining and Developing New Linking Methods
The original LIFE-M proposal planned to use existing machine linking methods to incorporate information. To determine the best linking method in current practice, the LIFE-M team used the newly created hand links to assess the method performance as measured by error rates (false matches and false non-matches), representativeness, and bias in a regression problem. This exercise proved crucial for improving performance and the results are published in the Journal of Economic Literature (Bailey et al. 2020). One finding is that, relative to clerical review, the share of false links for automated methods is higher across the board. A byproduct of this work was the development of a Stata ado-file, “autolink.ado”, which is posted at the repository at the Interuniversity Consortium in Political and Social Science Research to assist other researchers in similar analyses (Bailey and Cole 2019).
Bailey et al. 2020 also assessed the representativeness of linked samples and incorrect links when using different linking algorithms. For all methods, the data reject the representativeness of the linked sample and the incorrect links at the 1-percent level. Many automated methods are more likely to link boys with higher incidence of misspelled father’s last name, but more likely to link boys with a longer mother’s name. Most methods are more likely to link children with longer names, indicating that these linked records may come from more affluent families. At the same time, some methods are more likely to link individuals with more siblings, while other methods are more likely to link individuals with fewer siblings. In short, even though no linking algorithm appears to generate representative samples, different algorithms yield samples that are non-representative in different ways. Similar conclusions are drawn for incorrect links, suggesting that different linking algorithms may induce different types of systematic measurement error.
(3) Scaling Up Linkage at 97 Percent Precision
These results—unknown at the outset of the project—uncovered new opportunities for improving LIFE-M’s linking. The LIFE-M team developed and applied new automated linking methods to combat the high share of false links in other automated methods. We test the precision and recall of the model on our high-quality, hand-linked data by splitting the hand-linked data in half. Then we fit models on one half of the data, maximizing the recall in this sample, and then test the precision and recall in the hold-out sample. As a result of this process, we improved precision for high rates of recall relative to previous methods (see figure below).
Precision and Recall for G2 Birth Certificates to 1940 Census
Notes: The horizontal red line indicates the 0.97 precision level, which is the LIFE-M standard for links.
Machine models allowed us to link millions of people to their vital and census records with 97 percent precision. The table below shows how many links were made through this process. Link rates and numbers for other generations are available here.
Link Rates and Number of Total Links for G2
Notes: Link rates are in bold, and the number of links and linkable people are below the link rates. Link rates are calculated as the number of links divided by the number of linkable people. Linkable people for G2-siblings link are the G2 who were born in the core years (1900-1929) and have non-missing parents’ names (father’s first name, last name, and mother’s first name). Linkable people for G2-1940 Census link are the G2 who were born by 1940 and have non-missing names (both first and last names). Linkable people for G2-death, G2-marriage, and G2-child links are the G2 who have non-missing names. The number of links and linkable people for Males and Females may not sum up to the total number (All) because there are observations of unknown sex.