Frequently Asked Questions (FAQ)

General Information about the Project

What is LIFE-M?

What makes LIFE-M unique from other historic linking projects?

How many families are in LIFE-M?

Why is LIFE-M limited to two states?

Getting Started

Where should a new user start?

Basic Concepts

What is meant by “vital records”?

What is a “generation”?

What are “training” and “full” data?

What does “universe” mean in the variable descriptions?

Getting Data

How do I obtain data?

What format are the data in?

How long does it take to get data?

How big are the data files? What if the files are too big for me to handle?

Can I get the data with people’s names?

What information is included in restricted-use data?

How do I get access to the restricted-use data?

Using LIFE-M Data

How is a record identified?

How do I link LIFE-M datasets?

How do I link LIFE-M to external datasets?

How can I identify families?

How do I create an intergenerational dataset?

I want to include a family fixed effect in my analyses. How do I create a family ID?

How do I identify siblings?

How do I identify twins?

How do I determine birth order?

How do I calculate birth intervals?

How do I identify members of the extended family?

How should the data be weighted to be representative of my population of interest?

How do I cite LIFE-M data?

Who do I contact about additional questions or to report data issues?

Understanding the Data and Variables

Why does the same person have different HISTIDs across Census years?

Why does an individual who links to their marriage record(s) have missing spouse IDs?

Why does longevity apparently decrease over time in LIFE-M?

Why does age of marriage apparently decrease over time in LIFE-M?

General Information about the Project

What is LIFE-M?

LIFE-M, short for Longitudinal, Intergenerational Family Electronic Microdata, is a new data resource linking millions of individuals and families living in the late 19th and 20th centuries using vital records and decennial censuses. This combination of records provides a life-course and intergenerational perspective on the evolution of health and economic outcomes. Currently, the LIFE-M data contains records from Ohio and North Carolina. For more project information, check out our website and documentation.

What makes LIFE-M unique from other historic linking projects?

LIFE-M links people across vital records in addition to census records. Vital record links add information on birth family structure, marriage, and death. In addition, vital records allow LIFE-M to link large samples of women between their birth and marriage families because vital records typically include birth (or “maiden”) names.

How many families are in LIFE-M?

LIFE-M links over 5 million unique families across two generations, almost 2 million unique families across three generations, and 770,000 families across four generations.

Why is LIFE-M limited to two states?

LIFE-M is available for North Carolina and Ohio because these are the states that were able to provide relatively complete Vital records (birth, marriage, and death).

Getting Started

Where should a new user start?

Documentation is a natural starting point for new LIFE-M users. The variable descriptions and user guide provide a starting point to learn about and use the data. If you plan to link LIFE-M data to census records, you will also want to familiarize yourself with IPUMS.

Once you have become acquainted with the project, you can download the LIFE-M data from openICPSR.

Basic Concepts

What is meant by “vital records”?

Vital records in LIFE-M include birth, marriage, and death records.

What is a “generation”?

A “generation” defines how records are linked. In LIFE-M, the universe of birth records (in Ohio and North Carolina) from approximately 1900 to 1929 make up generation 2 (the core generation of LIFE-M). From the birth records, we identify their parents, or generation 1. Generation 2 can also be linked to their children, or generation 3, using available birth records and the 1940 Census. Lastly, using vital and census records for generation 1, we identify generation 0. Using birth-to-birth linking, we also identify siblings in generation 2.

Generation	Description	Linking Source
G0	Grandparent of G2	Parent in G1’s Census or on G1’s marriage or death record
G1	Parent of G2	Parent on G2’s birth certificate
G2	Core sample	Infant on birth certificate
G3	Child of G2	Infant on birth certificate or child in 1940 Census

Generations do not necessarily correspond with birth years. For example, it is possible a core G2 was born in 1900 and their older sibling was born in 1898. Both of these people would be in G2, even though the sibling is outside the core birth cohorts. Another example, it is possible a core G2 was born in 1900 and then they had a child (i.e. G3) in 1925. These people are in different generations, despite being born in the same linking years that were used to identify G2. In addition, an individual may be in multiple “generations,” due to the way the data were constructed. As in the example above, someone born in 1900 with a child born in 1925 will be a G2 and G1, and their child will be a G3 and G2.

What are “training” and “full” data?

Records were linked through two methods; by hand and machine algorithm. “Training” data refer to the records that were linked by hand (i.e. human trainers), and “full” data refer to the records that were linked via machine algorithm. For simplicity, we refer to the data as hand-linked and machine-linked. Refer to the link variables (e.g. LNKD, LNKM, etc.) to identify how (hand/machine) records were linked to vital and census records.

The hand-linked data contain higher quality links, but machine-linked data contain substantially larger sample sizes with more complete information. The machine-linked data were linked with a 97 percent precision rate (or a 3 percent type I error rate evaluated against the hand-linked data).

What does “universe” mean in the variable descriptions?

Similar to IPUMS, we use “universe” to identify the generations for which the variable applies. For example, county of birth (COB) comes from the birth records, and since G0 and G1 individuals are not linked to their birth records, they are not in the universe for COB. If a generation is listed as being in the universe, this does not imply the variable is non-missing for all records. To determine missingness for a specific variable, we direct users to the codes in the variable descriptions.

Getting Data

How do I obtain data?

You can download the LIFE-M data from openICPSR. In addition, the LIFE-M data can be linked to external datasets, including census records from IPUMS and the LIFE-M Ohio Causes of Death Data from the ICPSR Linkage Library.

What format are the data in?

LIFE-M provides data files in Stata (.dta) and R (.rds). In these programs, a row is an observation, and a column is a variable.

How long does it take to get data?

You can download the LIFE-M public-use files from openICPSR and the LIFE-M Ohio Causes of Death Data from ICPSR Linkage Library right away. However, data extraction from IPUMS requires registration and is not immediate.

How big are the data files? What if the files are too big for me to handle?

Collectively, the LIFE-M data are under 6 GBs (including the master data, linked death, marriage, census crosswalks, and location file). However, when linking LIFE-M data to Census records, provided by IPUMS, the files can become quite large. IPUMS provides a few ways to deal with this. In general, a best practice when analyzing large datasets is to write and debug code on a sample from the data, before running for the entire dataset. We recommend users have at least 16 GBs of RAM available when using LIFE-M data and merging it to IPUMS.

Can I get the data with people’s names?

Data containing names are restricted, but users can request restricted-use data.

What information is included in restricted-use data?

In addition to all variables in the public-use data, the restricted-use data provide names and exact dates of birth and death, when available.

How do I get access to the restricted-use data?

To apply for access to the restricted-use data, submit a data request using this application. The data request usually takes 14 days.

Using LIFE-M Data

How is a record identified?

The variable LIFEMID uniquely identifies a person.

How do I link LIFE-M datasets?

The LIFE-M Master Data can be linked to the Death, Marriage, Census Crosswalk, and Location files using the variable LIFEMID. For more details, refer to the user guide.

How do I link LIFE-M to external datasets?

For linking to 1880, 1900, 1910, 1920 and 1940 Census records, use the variables HISTIDYR (where YR represents the last 2 digits of a year) in the Census Crosswalk to link an individual in LIFE-M to their census records. The variables LNKCYR in the Master file will let you know how the individual has been linked to their census records for a given year. For more details, refer to the user guide. Interested users can also link to the 1930 Census using MLP links. Refer to the

For linking to the Ohio Causes of Death Data, use the variable CODID in the LIFE-M Death Data.

How can I identify families?

There are multiple ways to identify families, depending on how many generations you want to include in your analysis. If you are only interested in single-generation families, use SPID variables. If you are interested in two-generation families, then you will want to use MOMID, DADID, and SPID. If you are interested in three-generation and four-generation families, you will first have to determine how to define these and potentially create new variables and datasets. Refer to the user guide for examples and sample code.

How do I create an intergenerational dataset?

If you are interested in questions like, how does mothers’ education impact her children’s education or what is the relationship between fathers’ occupation and his children’s occupation, you will need to create an intergenerational dataset that links children to their parents. Refer to the user guide for more details and code.

If you are interested in the transmission of human capital from grandparent to child, you will need to create an intergenerational dataset that links children to their grandparents. Refer to the user guide for more details and code.

Lastly, if you want to create a dataset of four generations, refer to the user guide for an example of how to do this.

I want to include a family fixed effect in my analyses. How do I create a family ID?

First, make sure your data is in the preferred format for analysis. Then, you can create a family ID based on your definition of a family. Refer to the user guide for some examples.

How do I identify siblings?

Siblings can be identified using MOMID and DADID. Full siblings will have the same MOMID and DADID, whereas half-siblings will only have one parent in common. For more details, refer to the user guide. If you are interested in analyses of blended families (step-siblings and step-parents), you will need to use MOMID, DADID, and SPID variables. However, such an analysis is more complex.

How do I identify twins?

Twins (and triplets) will have the same mother and birth date. Twins can be identified using MOMID and DOB. The original date of birth is only provided in the restricted-use data, so the date of birth in the public-use data for twins may not match exactly. But the year and month of birth should be the same within twins. Refer to the user guide for sample code.

How do I determine birth order?

You can determine birth order using MOMID, DADID, and DOB. Refer to the user guide for sample code.

How do I calculate birth intervals?

Birth intervals measure the time (in months) between the dates of birth between children. Once you have determined birth order, you can calculate birth intervals using DOB. Refer to the user guide for sample code.

How do I identify members of the extended family?

Familial relationships can be identified through parent and spouse IDs. We provide code to identify grandparents in the user guide. To identify aunts and uncles, you need to first identify siblings, and then use these relationships and parent IDs. Lastly, to identify in-laws, you need to use spouse and parent IDs.

How should the data be weighted to be representative of my population of interest?

Each population of interest will be different depending on the research question. For that reason, we do not provide weights in the data, but Bailey, Cole, and Massey (2019) provide a step-by-step process to create weights for specific subsamples and purposes. See the user guide for some generic example code.

How do I cite LIFE-M data?

We appreciate your support. Please cite LIFE-M using the citation generated by openICPSR.

A reference for published work should be sent to lifemstudy@ucla.edu so we can add your study to our Research Page.

Who do I contact about additional questions or to report data issues?

Please send questions to lifemstudy@ucla.edu.

Understanding the Data and Variables

Why does the same person have different HISTIDs across Census years?

HISTIDYR (where YR represents a year) in LIFE-M follows the structure of HISTID in IPUMS. HISTID in IPUMS is a census-specific, person-unique identifier. The same person in different Census years will have a different HISTID. You can download MLP links from IPUMS to connect the same person from adjacent census years.

Why does an individual who links to their marriage record(s) have missing spouse IDs?

SPIDX indicates the LIFEMID of a spouse, so if the spouse on the marriage record is not in the LIFE-M data, then there will be no spouse ID. Similarly, it is possible for an individual to have a SPIDX, without linking to any marriage record(s) because some spouses were identified from other records. See the user guide and variable descriptions for more details.

Why does longevity apparently decrease over time in LIFE-M?

There is incomplete digitization of state death records and two-sided censoring of age at death. Individuals from older cohorts must have died after the death records collection begins; therefore, they are selected into living longer lives, and the mean age at death for this selected group is greater than for their cohorts. Individuals from younger cohorts must have died before the death records collection ends; therefore, these individuals are selected to die younger, and the mean age at death for this selected group is lower than for their cohorts. This leads to the apparent decline in longevity across cohorts. For a more detailed explanation and visuals, refer to “Additional Data Details” of the documentation.

Why does age of marriage apparently decrease over time in LIFE-M?

Like the death records, there is incomplete digitization of state marriage records and two-sided censoring of age at marriage. Individuals from older cohorts must have married after the marriage records collection begins; therefore, they are selected into marrying later, and the mean age at marriage for this selected group is greater than for their cohorts. Individuals from younger cohorts must have married before the marriage records collection ends; therefore, these individuals are selected to marry younger, and the mean age at marriage for this selected group is lower than for their cohorts. This leads to the apparent decline in age of marriage across cohorts. For a more detailed explanation and visuals, refer to “Additional Data Details” of the documentation.