Record Linking

The scale of the LIFE-M project makes hand linking of all records cost and time prohibitive. The LIFE-M project exploits advances in record-linkage to create linked longitudinal samples of families and individuals. Building on the IPUMS historical census linking methodology (Goeken et al. 2011), the LIFE-M approach relies on machine learning calibrated by a new hand-linked data. There are three steps in the process:

  1. systematically hand linking records for each linking type;
  2. refining and developing new linking models for each individual link type; and
  3. implementing cutting-edge computational methods to link records on a very large scale.

(1) Hand-Linking Data

The lack of ground truth for historical data has been a central challenge for assessing the performance of widely used methods and training machine models. As a first step in addressing this critical need for high-quality, historical ground truth, the LIFE-M project created large-scale, highly vetted hand-linked samples. While these data are not "ground truth" in the purest sense of the term, they were created to mimic this standard.

The LIFE-M hand-linking process first cleaned and standardized the data and then used machine learning to created sets of "candidate links" based on machine-generated similarity scores. These sets of candidates were viewed independently by two trained reviewers. The reviewers then decided whether a link was correct or not. To be conservative, LIFE-M reviewers were trained to reject links if they were not completely certain the links were correct. If the two reviewers agree (either on a link or no link), this choice is taken as the truth. In cases where the two reviewers disagree, the records were independently re-examined and coded by three new reviewers.

Before beginning this work, reviewers participated in a rigorous training process where they received detailed feedback on their accuracy, including mentoring by an experienced reviewer and multiple rounds of feedback on roughly 30 hours of their work. In order to work on the LIFE-M review team, reviewers had to reach a 0.95 correlation with a truth dataset.[1]

To maintain quality, the LIFE-M reviewers were audited once per week and given feedback on their speed and accuracy. Reviewers also met weekly to discuss common mistakes, difficult cases, and learn about historical-contextual factors affecting the quality of the records (e.g., many men and women misreported their age (year of birth) on marriage certificates but not their day and month of birth in order to marry younger).

To validate data quality, the Family History and Technology Lab at Brigham Young University (BYU) performed an independent check of the LIFE-M hand links. BYU compared 1,043 LIFE-M hand links to those created by genealogists and users of FamilySearch.org which are stored on the FamilySearch Tree. For 1,043 birth certificates linked to the 1940 Census by LIFE-M and FamilySearch.org users, the LIFE-M hand links agreed with FamilySearch.org users 96.7 percent of the time, implying a maximum error rate in the LIFE-M hand links of around 3.3 percent.

Table 1 presents the results of this systematic and careful process. Hand-links to the 1940 Census include 24,890 boys born from 1881 to 1940 for a match rate of 44 percent. While this rate appears low by modern standards with administrative data, but it is much higher than contemporaneous studies which average around 25 percent without adjusting for high rates of linking error. In addition, LIFE-M hand links to the 1940 Census include 9,306 G2 girls born between 1881 and 1940—many of whom changed their names at marriage and were omitted from other historical linking projects. LIFE-M hand links include 25,073 matches to death certificates, 19,457 links to marriage certificates, and 31,864 links to G3 children in the birth certificate data.

Table 1. Number of Hand Links (Link Rates) Completed for G2 in the LIFE-M Data

G2 individuals sampled G2 individuals + siblings G2 individuals born < 1940 G2 links to 1940 Census G2 links to death certificates G2 links to marriage certificates G2 links to G3
All 0.31 0.23 0.18 0.29
  30,908 110,077 109,706 34,196 25,073 19,457 31,864
Men 0.45 0.30 0.17 0.29
16,103 57,098 56,890 24,890 17,200 9,539 16,808
Women 0.18 0.15 0.19 0.28
  14,805 52,979 52,816 9,306 7,873 9,918 15,056
Notes: Link rates are presented in italics and the number of links is presented below these figures.

(2) Refining and Developing New Linking Methods

The original LIFE-M proposal planned to use existing machine linking methods to incorporate information. To determine the best linking method in current practice, the LIFE-M team used the newly created hand links to assess the method performance as measured by error rates (false matches and false non-matches), representativeness, and bias in a regression problem. This exercise proved crucial for improving performance and is forthcoming in the Journal of Economic Literature (Bailey et al. forthcoming). The results are also summarized below. A byproduct of this work was the development of a Stata ado-file, "autolink.ado", which is posted at the repository at the Interuniversity Consortium in Political and Social Science Research to assist other researchers in similar analyses (Bailey and Cole 2019).

Matching Errors

To be as fair as possible to automated linking methods and provide an independent metric of LIFE-M's hand-linking performance, we determined matching errors using a "police line-up" process. The process proceeded in several steps:

  1. If a record was hand-linked and the output of an automated method, we coded the link as correct.
  2. If the coded link differed between the hand-linking process and the automated method, the record was re-reviewed by two additional people. This reviewer asked two reviewers to examine the set of candidate links including the LIFE-M hand-link (if one was made), the link made by the automated method (if one was made), and a machine-generated set of close matches. Reviewers did not know which link was chosen by which method and were asked to determine the correct link.

The results of (1) and (2) were taken as the "truth" for the purposes of coding matching errors. This process gives the links from the hand-match and the automated method an equal shot at being chosen to avoid preferential treatment.

Figure 2 summarizes the results. The length of each bar represents the match rate, computed as the share of the baseline sample of G2 boys who matched to the 1940 Census. The LIFE-M hand-review matched 45 percent of the baseline sample. Ferrie's (1996) method matched 28 percent of the baseline sample, and Abramitzky, Boustan, and Eriksson (2014) achieve a higher link rate of 42 percent, because the method does not impose Ferrie's (1996) uncommon name restriction. Feigenbaum's (2016) regression-based method matches 52 percent of the baseline sample, both when using coefficients from his dataset (Iowa) and coefficients from a random sample of the LIFE-M links. Abramitzky, Mill, and Pérez (2018)'s implementation of Fellegi and Sunter (1969) links 46 percent of the sample when using less conservative cutoffs and 28 percent of the sample with more conservative cutoffs.

The share of the entire sample that is incorrect is displayed in red. Less than 2 percent of the LIFE-M hand links were reversed, consistent with validation by BYU. The column on the right in Figure 2 show that the estimated Type I, or matching, error. We compute this share by dividing the share of the total sample that is incorrect by the match rate. Because the LIFE-M match rate is 44 percent, this implies a Type I error rate of 4 percent (approximately 0.017/0.44).

Figure 2. Match Rates and False Links for Selected Linking Methods

generational structure diagram
Notes: The bars show the performance of different algorithms linking LIFE-M G2 boys to the 1940 Census.

Relative to clerical review, the share of false links for automated methods is higher across the board. These error rates are consistent with Massey (2017) who uses contemporary administrative data linked by Social Security Number as the ground truth. She finds that methods similar to Ferrie (1996) are associated with 19 to 23 percent Type I error rates. Abramitzky, Boustan, and Eriksson (2014)'s refinement of Ferrie (1996) increases match rates to 40 percent, but only half of the added links appear to be correct, and the Type I error rate increases to 32 percent. Feigenbaum's (2016) supervised, regression-based machine learning model produces a Type I error rate of 34 percent when using the Iowa coefficients, and the Type I error rate decreases to 29 percent when estimated using LIFE-M data. Finally, Abramitzky, Mill, and Pérez (2018)'s less conservative cut-off results in the highest error rate at 37 percent. The difference between the conservative and less conservative versions of Abramitzky, Mill, and Pérez (2018) highlights the sensitivity of performance to parameters.

In terms of missed links, Ferrie (1996) correctly linked the lowest share of the sample without error, and other methods correctly linked 24 to 29 percent. Feigenbaum (2016)'s algorithm correctly linked 34 percent of the LIFE-M sample with the Iowa data and 37 percent of the sample using the LIFE-M data, yielding the lowest rate of missed links (Type II errors).

Representativeness of Linked Sample and Linking Errors

Our evaluation also assessed the representativeness of linked samples and incorrect links when using different linking algorithms. For all methods, the data reject the representativeness of the linked sample and the incorrect links at the 1-percent level. Many automated methods are more likely to link boys with higher incidence of misspelled father's last name, but more likely to link boys with a longer mother's name. All methods except Feigenbaum (2016) with estimated coefficients are more likely to link children with longer names, indicating that these linked records may come from more affluent families. At the same time, some methods are more likely to link individuals with more siblings, while other methods are more likely to link individuals with fewer siblings. In short, even though no linking algorithm appears to generate representative samples, different algorithms yield samples that are non-representative in different ways. Similar conclusions are drawn for incorrect links, suggesting that different linking algorithms may induce different types of systematic measurement error. Building on these results, Bailey, Cole, and Massey (2019) suggest using inverse-propensity score reweighting to improve representativeness of machine linked samples.

In summary, existing methods are useful for linking large scale data, but they tend to introduce errors that could be consequential for inference. To demonstrate the combined consequences of linking errors and non-representativeness, Bailey et al. (forthcoming) show that historical rates of intergenerational mobility may be attenuated by up to 20 percent when using existing methods.

(3) Scaling Up Linkage at 90-, 95- and 97-Percent Precision

These results—unknown at the outset of the project—uncovered new opportunities for improving LIFE-M's database by developing and refining machine linking methods. Together with Eytan Adar (University of Michigan School of Information) and Jared Murray (University of Texas-Austin), the LIFE-M team developed new automated linking methods (Murray and Bailey 2020) and is beginning to apply them in conjunction with our high-quality, hand-linked data. Presently, we are pleased to report that our revised methods and carefully created hand links have been combined to machine-link 479,139 boys (G2 birth certificates) to the 1940 Census at 97-percent precision; 591,813 at 95-percent precision, and 759,059 at 90-percent precision. In addition, we have linked 396,822 boys (G2 birth certificates) to their death certificates at 97-percent precision and 541,997 boys to their death certificates at 90-percent precision. These efforts are on-going and should produce an even higher quality and larger-scale LIFE-M database.

(4) References

Abramitzky, Ran, Leah Boustan, and Katherine Eriksson. 2014. "A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration."  Journal of Political Economy 122 (3):467-506.

Abramitzky, Ran, Roy Mill, and Santiago Pérez. 2018. "Linking Individuals Across Historical Sources: a Fully Automated Approach."  National Bureau of Economic Research Working Paper Series No. 24324. doi: 10.3386/w24324.

Bailey, Martha, Connor Cole, and Catherine G. Massey. 2019. "Simple Strategies for Improving Inference with Linked Data: A Case Study of the 1850-1930 IPUMS Linked Representative Historical Samples."  Historical Methods: A Journal of Quantitative and Interdisciplinary History. doi: https://www.tandfonline.com/doi/full/10.1080/01615440.2019.1630343.

Bailey, Martha J., and Connor Cole. 2019. "Autolink.ado." accessed 2019-06-13. http://doi.org/10.3886/E110164V1.

Bailey, Martha J., Connor Cole, Morgan Henderson, and Catherine G. Massey. forthcoming. "How Well Do Automated Linking Methods Perform in Historical Samples? Evidence from New Ground Truth."  Journal of Economic Literature. doi: http://www-personal.umich.edu/~baileymj/Bailey_Cole_Henderson_Massey.html.

Feigenbaum, James J. 2016. "A Machine Learning Approach to Census Record Linking." http://scholar.harvard.edu/files/jfeigenbaum/files/feigenbaum-censuslink.pdf?m=1423080976. Accessed March 28, 2016.

Ferrie, Joseph P. 1996. "A New Sample of Males Linked from the 1850 Public Use Micro Sample of the Federal Census of Population to the 1860 Federal Census Manuscript Schedules."  Historical Methods 29 (4):141-156.

Massey, Catherine G. 2017. "Playing with matches: An assessment of accuracy in linked historical data."  Historical Methods: A Journal of Quantitative and Interdisciplinary History:1-15. doi: 10.1080/01615440.2017.1288598.

Murray, Jared, and Martha Bailey. 2020. "The Highlander Probability Model: Power and Precision from Imposing Constraints in One-to-One Matching."  University of Michigan Working Paper.

 

[1] This truth dataset has been vetted by multiple individuals for accuracy. The cases for this truth dataset are selected to test the trainers' knowledge and decision-making for a variety of linking problems.