Data Structure

Until recently, most of the records in LIFE-M were handwritten and stored in archives., a nonprofit genealogical website, is digitizing these records and made them publicly available without charge. These microdata include over 60 million births for 19 states, 70 million deaths for 41 states, and 55 million marriages for 47 states. Thanks to on-going data transcription efforts, the number of digitized records is rapidly increasing.

LIFE-M's linkage of millions of vital records with census data will creates a large-scale intergenerational and longitudinal database, containing census information as well as measures of health, family background, early life context and geography. We create weights to build representative samples of both men and women. Unprecedented sample sizes include large samples of understudied populations such as women, racial/ethnic minorities, and immigrant families.

Intergenerational Structure of LIFE-M Data

Figure 1 provides date ranges to illustrate the cohort and generational structure. LIFE-M samples from birth certificates for generation 2 (G2), who were born from around 1900 to 1930. Then, the project links to their parents (generation 1, G1) and grandparents (generation 0, G0) as well as to G2's children (generation 3, G3). Dates are approximate, but G0 is generally born before 1860 (they are contemporaries of the Union Army cohorts). G1 was born 1870-1899, and G3 was born from 1930 forward (contemporaries of the Health and Retirement Study cohorts).

Figure 1. LIFE-Generational Structure

generational structure diagram

Notes: G2 is the core sample of infant birth certificates for which LIFE-M constructs intergenerational and longitudinal data.

Linking Methodology

Linking Variables

One common problem with name-linked samples in the census is that some names have many exact (or very similar) matches. Multiple matches have been so problematic that past work has eliminated common names entirely from samples to be linked (Ferrie 1996, Ruggles 2002). Compounding this problem is age heaping or misreporting in the census (e.g., there are many too many 50 and 60 year olds relative to 51 and 63 year olds), a problem particularly acute among blacks (Elo and Preston 1994). Another example is lying about one's age to marry—a common practice through the 1970s to circumvent minimum age at marriage requirements (Blank et al. 2009).

  1. Rich variables in vital records permit LIFE-M to reduce multiple matches (even for common names) and often identify and correct such misreporting:
  2. Birth, marriage, and death records are official documents and contain full names (first, middle and last). Other sources, such as the census, may use middle names as "first names" or nicknames.
  3. Birth records contain multiple observations for parents who have more than one child. Because less educated and foreign-born populations tended to have more children, birth records provide more observations on parents most likely to have name misspellings (due to illiteracy, a non-Anglophone name, or English as a second language.) Multiple observations for the same parents allow the cleaning procedure to identify with great accuracy the parents "Cashus Atherton married to Carrie Lavigne" as a misspelling of "Cassius Atherton married to Carrie Lavigne" (the latter string appears on four birth records from the same small Vermont birth town.) Census data, in contrast, contain one observation per person.
  4. Birth and marriage records provide a cross-walk between women's birth and married names. Vital birth records contain the mother's birth name (i.e., maiden name) in 86% of cases as well as her married name (given by her spouse's birth name which is also on the record). Marriage records supplement this information (when available), but marriage registration is incomplete until the 1970s.
  5. Birth, marriage, and death records contain rich information on family structure. In addition to the full name and exact date and place of birth of the individual associated with the event, these records often contain information on the individual's parents' full names (including in some cases mother's birth name and parents' birthplace), the event date, and the event location.
  6. Death records allow a better understanding of reasons for failure to match individuals with census or marriage records. They also can reconcile frequently-reported age differences in the census.

Table 2 compares the LIFE-M linking variables (characteristics that should theoretically not change over time) for the state of Ohio with linking variables used by IPUMS in its linked census samples. For all cases but race, linking variables in LIFE-M are at least as detailed (and often more so) than census variables. Notably, the addition of parents' full names (including mother's birth name), exact date of birth, and exact place of birth (town/county) should significantly enhance LIFE-M's ability to generate unique, high quality matches. LIFE-M vital records do not, however, contain race (this information will be added by linking to census variables). We do not view this as particularly problematic, because the coding of race may change over time due, perhaps, to changes in cultural perceptions or changes in individual identity. Overall, Table 1 provides a strong rationale that the linking variables in LIFE-M perform at least as well as census linking variables.

Table 2. Comparison of Immutable Characteristics in LIFE-M and IPUMS Historical Samples
Table 2:  Comparison of Immutable Characteristics in LIFE-M and IPUMS Historical Samples

Notes: F=Father, M=Mother. IPUMS linkage variables are taken from Ruggles (2002: Table 1). *Not available in some collections/years.

Linking sequence

Figure 2 provides an overview of LIFE-M's linking process. The first step of the process is to reconstitute birth and marriage families of the late 19th and early 20th century birth cohorts (G2) (Figure 2, arrow 1). This requires linking birth records (G2) to one another using parents' full names (G1) and other information such as parents' birthplaces (when available). We also examine records with only one parent to identify cases of parent deaths and remarriage.

Figure 2. LIFE-M Linkage Procedure from Vital Statistics to Censuses and Other Datasets

Table 2:  Comparison of Immutable Characteristics in LIFE-M and IPUMS Historical Samples

Notes: G0: born <1860 (~ UA cohorts); G1: born 1870-1899; G2: born 1900-1929; G3: born 1930-1950(~HRS cohorts). Planned links to military records and ship manifests omitted for space reasons.

The birth certificates provide links for at least two generations. Also, G2 can be linked to their own children (G3), because birth records contain mother's birth names. This step allows the reconstruction of two to three generations of interrelated families. In addition, the resulting family sizes are compared to census tabulations to examine data quality.

Our second step (arrow 2) is to link marriage records by bride and groom name, exact date of birth (allowing for over-reporting of age, Blank et al. (2009)), and place of birth (when available in the collection). Although 90 percent of women born in this period were ever married (Bailey et al. 2014), marriage registration was highly incomplete and less than a full match rate is expected. This step can also be completed for some of G1 and G3.

The third step is to link G2 to their grandparents (G0) using the 1900 and 1880 censuses (arrow 3). Parents' (birth or married) names (G1) are linked to the 1900 census names which provides key information on birthplace, age, and race. Next, we link G1 to the 1880 census using their names only or names in addition to ages, birthplace, and race (obtained from the 1900 link). This step connects G2 to G0 and is important because it allows for the addition of G1's early life family conditions, including G0 ancestry/heritage, economic circumstances such as occupation, race, and address. The fourth step links four generations (G0, G1, G2, G3) to the full-count 1940 census (arrow 4). This step uses full names (including birth and birth names of women), exact birth dates/age, and birthplace. The 1940 Census is the first census to include rich information on educational attainment, wages and salary, and many employment outcomes. This is only possible for some of G0 (many will have passed away before 1940), but most of G1, G2 as adults (in their marriage families), and G3 as children (in birth families).

Importantly, the 1940 census allows cross-validation of the linkages in steps 1 and 2 for Figure 2 using birthplace, age, children born (sample line), age at marriage (sample line), parents' birthplace (sample line), and spouse name. It also links the names of some children (G3) including their birthplace and siblings, which can be compared to birth records in step 1. G1 and G2 can also be linked by name to the now fully indexed 1940 census (arrow 4). This links education and wages to the parents of G3 as children. The 1940 census also contains information to cross-validate the linkages in steps 1 and 2 (sample line respondent variables indicated with * in Figure 2): birth state or county, age, children born, age at marriage, and spouse name. It also links the names of some G3 children with their birthplace and siblings, which can be compared to birth records in step 1 (Figure 2).

The final step (arrow 5) is to link G0-G3 to death records. The linking variables are full birth and/or married names, exact day of birth, and parents' names and place of birth when available in the collection. Almost all of G0-G2 will have died in the time span covered by the death records, as will many of G3. Because death records span almost the entirety of the 20th century for most collections, we will observe longevity for at least three generations. We also link infant deaths to parents' names to fill in missing birth records, because many infant deaths were not recorded as births (and this helps us reconstitute families further).


The final linked set of records will not be representative of the population from which they are drawn (see Bailey, Cole, Henderson and Massey forthcoming). To address this, LIFE-M creates weights for the fully linked LIFE-M data—for each dataset to which it links as well as all combinations of linked data--which adjust the samples for under- or over-representation of certain subgroups or characteristics. See Bailey, Cole and Massey (forthcoming) for more details on this procedure.