NAPP Linked Samples
The North Atlantic Population Project microdata permit linking of individuals between census years for longitudinal analysis. The current version of NAPP includes samples of the United States that link 1880 to seven other census years, and 6 samples for Norway that link males and couples between the 1865, 1875, and 1900 censuses. Each sample contains data on all linked individuals and records for other members of the linked person's household from both years. Currently, the NAPP linked datasets contain information on nearly 600,000 people at two points in time (180,000 in the U.S. and 400,000 in Norway).
Our procedures for linking individuals in the different North Atlantic countries are similar. We developed our basic methods by linking the U.S. complete count 1880 census to the sample datasets from 1850 to 1930. We then extended the method to other North Atlantic countries, adapting our approach to address differences in the information available for linking. Our primary goal in constructing these longitudinal databases has been to minimize selection bias and maximize representativeness of the linked cases. This approach has required a very conservative linking strategy.
The discussion below describes the detailed linking procedure for the United States. That discussion is followed by a section describing modifications necessary to develop the linked data for Norway. For a more detailed description of the U.S. linking methods see the linking documentation on the IPUMS USA website.
UNITED STATES LINKED SAMPLES, 1850-1930
Linked representative samples for the United States have been developed for 7 pairs of years involving the 1880 NAPP sample: 1850-1880, 1860-1880, 1870-1880, 1880-1900, 1880-1910, 1880-1920, and 1880-1930. Each of these contains three independent linked samples: one of men, one of women, and one of married couples.
All files contain weights that allow users to generate representative population estimates. The universe of people represented in the male files and couple files are straightforward. The linked male files represent all men who were in the U.S. in both census years. The linked couples files represent all married couples who were in the U.S. and married to one another in both census years. The universe of people represented in the linked female files is more complicated: these files represent women who were in the U.S. in both census years and who did not change their surname due to marriage between the two census years. For this reason, the female file represents only women who were single at both time periods, were married to the same person at both time periods, or who transitioned from married to widowed or divorced between time periods. The female files do not represent women who entered into marriage between time periods.
There are three variables that are unique to the Linked Representative Samples: HHSEQ, LINKTYPE, and LINKWT. HHSEQ is used in combination with SERIAL to identify all members of the linked person's household in both census years. This variable is necessary because households with more than one linked person are written out multiple times. LINKTYPE specifies why each case was included in the Linked Data Sample. LINKWT indicates the weight value assigned to each linked case. These are all weighted samples, and the LINKWT variable must be used to generate representative population estimates. Each of these variables is described in more detail below.
Basic linking strategy:
Each linked sample was produced by comparing cases from the complete-count 1880 data file to those from a sample of U.S. population in another census year. For example, the 1870-1880 linked samples use the 1880 100% data and a 1% sample from the 1870 U.S. population census.
The U.S. linking strategy relies on five variables that should theoretically not change over time: birth year, state of birth, given name, surname, and race. Records were only compared in the linking process if they had an exact match on race and state of birth. The age and name variables, on the other hand, were permitted to have some variation. Name information was "cleaned," following the procedures specified below. Age was allowed to be up to seven years higher or lower than would be expected.
One of the major strategic decisions was to ignore information from other co-resident individuals, because of potential bias issues. For example, if linking decisions were based on the characteristics of co-resident family members, then the samples would be biased in favor of those with co-resident family members. Similarly, using place of residence information in linking decisions would have resulted in the over-representation of non-movers.
Each person in a household is evaluated independently for a longitudinal link. Consequently, multiple individuals in the same household may receive a primary link. Once primary links are established, we attempt to link members of their households across census years, using a distinct linking process. A household member person will receive a household link only if they do not receive a primary link and have consistent race, birthplace, name and age information in two census years. These household links were made for the convenience of researchers interested in the stability of primary linked persons' households.
These strategies were designed to produce a conservative, but representative set of linked data. We are confident that the links provided in these data are solid, and that, taken as a whole, the database is representative of those surviving from one census to the next.
Linked file structure and important linking variables: All linked samples consist of independently linked records and all persons who lived in a linked person's household during either census year. Consequently, the linked files contain many individuals other than those who were linked using the systematic linking procedures. In addition, if a household contains multiple individuals with a primary link, the household will be repeated in the data file multiple times (once for each primary link). The linking variables LINKTYPE and HHSEQ can be used to identify individuals receiving primary links, as well as other residents in their households.
Researchers who are not interested in co-resident persons can greatly simplify the dataset by removing all cases where LINKTYPE is greater than 0. This will remove all persons who were not linked in the primary linking process. Once this is done, LINKTYPE and HHSEQ can be ignored. Information on co-resident persons is one of the most powerful features of the linked samples, however, and we expect that this information will be useful for a variety of research purposes. For instance, co-resident persons can be used to create variables describing characteristics such as birth order, father's occupation, or the employment status of household members.
LINKTYPE indicates the reason an individual was included in the linked data file. The codes are as follows:
0 = Primary link.
1 = Additional primary link. This person will have a LINKTYPE value of 0 in a repeat of the non-1880 household (or the 1875 Norwegian household).
5 = Household link. This person was linked after the systematic linkage process, and is present in the household of a primary linked person in two census years. These links were made for the convenience of researchers interested in the stability of primary linked persons' households.
9 = Unlinked person, present in the household of a primary linked person during only one census year.
HHSEQ is the linked data sample household identification number, to be used in conjunction with SERIAL. Each primary linked person has a unique combination of SERIAL and HHSEQ that is shared by all members of the linked person's households in both census years. HHSEQ values in the U.S. range from 1 to 9.
Examples of households from the 1870-1880 linked file are presented in Table 1.
In the first example, the primary link (indicated by LINKTYPE = 0) is the third record in the 1870 household, James H A, who was 10 years old in the 1870 census. Based on consistent race, birthplace, name and age information, we also establish household links for all other members of the 1870 household with the exception of the head of household, who does not appear to have been present in the 1880 household (and this is indicated by LINKTYPE = 9). In addition, the 1880 household has one member, Mary Gillan, a 40-year-old domestic servant, who was not present in 1870. Because the linked data file contains variables from two census years, suffixes are used to identify the source: variables ending in "_1" are data drawn from a sample year, e.g. 1870, while variables drawn from 1880 are indicated by the suffix "_2".
The second household in Table 1 gives an example of an 1870 household with two primary links. Since we have two primary links, the 1870 household is written out twice (indicated by HHSEQ), with each version providing links to a different 1880 household. In this case the first primary link is the head of the household, 57-year-old Moses Woodfin. Of the seven household members in 1870, all are also co-resident in 1880 with the exception of William and 15-year-old Moses. However, 15-year-old Moses is also a primary link, indicated by LINKTYPE = 1 in the first version of the household. The second version of the 1870 household (i.e., HHSEQ = 2) shows 15-year-old Moses as the primary link; in 1880 he was 25 years old, living with his wife and three young children.
Detailed linking procedure:
Standardization of the first name strings: This was necessary due to the inconsistent use of abbreviations, initials, diminutives, and simple mis-spellings. Our approach to these issues was to standardize first name strings before beginning the linking process. The first step in our name standardization was to construct "dictionaries" of first names containing all unique name strings from the 1880 complete-count data and the various samples. This resulted in approximately 1.7 million unique strings for first names. Table 2 shows common first names for men, with the FNAME field containing the raw string. When a person had an initial, or a listed middle name, the raw string was parsed into two fields: XNAME1 and XNAME2. The resulting standardized name is in the field STD_XNAME1. Given the large number of unique strings, we standardized only first names that appeared more than 100 times, of which there were approximately 20,000. Most of this work involved standardizing commonly used abbreviations and diminutives, such as "Wm" to "William", "Joe" to "Joseph", and "Tom" to "Thomas".
Identifying potential links: The creation of the linked samples required millions of individual comparisons. Linked persons were required to have the same sex, race, and birthplace, so the comparison of these variables was straightforward. But since more tolerance was permitted with the name and age variables, many potential links required more complex decision-making. Fortunately, several software packages have been developed to deal with the ambiguities in record linkage. We relied on the Freely Extensible Biomedical Record Linkage (FEBRL), software created by Peter Christen and Tim Churches at the Australian National University's Data Mining Group.
A typical input file to the FEBRL software would contain individuals with a given race, sex, birthplace, and--for the married samples--marital status. For example, for the creation of the 1870-1880 linked male sample, the software was tasked with comparing two groups of people: white males born in Michigan who were in the 1870 sample, and white males born in Michigan who were present in the 1880 data.
Given these constants, the FEBRL software made comparisons based on age, first name, and last name. Age was allowed to vary up to seven years in either direction. Thus, 30-year-olds in the 1870 (who would be expected to be 40 in 1880) were compared to all persons between the ages of 33 and 47 in 1880. When an age match was not perfect, the similarity of ages was quantified as the difference from the expected age (in years) as a proportion of the individual's age in the non-1880 year. As was true with age, we did not set the software to look for only exact name matches. First names and last names had already been standardized to account for common abbreviations and diminutives (as described above). The quantification of name comparisons relied on the "Jaro-Winkler distance algorithm," which takes a mathematical approach to dealing with transpositions of particular letters or syllables in comparing the similarity of any two alphabetic strings (Jaro 1989; Porter and Winkler 1997; Winkler 1990, 1993, 2002). Using these approaches for comparing ages and names, the FEBRL software generated "similarity scores" for each of the three variables, for every comparison made (i.e., often thousands of comparisons for each person). The next step was to convert these millions of similarity scores into either true or false links.
True and False links: After all files from a given pair of census years had been through similarity score construction, we needed to classify all potential links as "true" or "false". The identification and classification of data patterns is a major issue in computer science, particularly in research on data mining and machine learning. Our procedures draw heavily on the concept of "Support Vector Machines, or SVMs" (Vapnik 1998; Cristianini 2000; Abe 2005). SVMs are a set of methods and tools used to conduct machine learning. Our use of SVMs relied on the LIBSVM software (developed at the National Taiwan University's Department of Computer Science).
In our case, there were two main inputs to the SVM software. The key input was our database of "similarity scores" of age and name variables, representing millions of individual person-to-person comparisons. The SVM was tasked with reviewing each comparison and classifying it as true or false. The second input to the SVM was "training data" that has already been classified properly. To create this training set, data entry operators considered a random sample of potential links and coded each one as either good or bad. Operators were not permitted to view the original census forms; it was important that they have access only to the same information that would be input to the SVM. When a majority of operators classified a potential link as a "yes", then it was coded as a "yes" in the training data. The remaining cases were coded as "no".
Using the training data and the similarity scores created by FEBRL, we ran our potential links through the LIBSVM software. The end result of the SVM classifier process was a file consisting of all potential links and the SVM-produced "confidence score," where a positive score meant a "true" link and a negative score meant a "false" link. The SVM classification often resulted in more than one "true" link being made to a given sample record. Whenever a person from a non-1880 sample had more than one true link in the 1880 data, we excluded the case from the linked files. This meant that there were many cases where we declined to make a link even if only one perfect match was located in the 1880 population. Table 3 displays potential pairings and their confidence scores for John Bradley of South Carolina. Although the first pairing is the only exact match, we do not link these records because three other potential pairings receive a positive confidence score. Our experience with nineteenth-century census data suggests that this type of conservative approach is necessary.
Final checks and usability testing: An important part of the process was a set of links for 1870-1880 and 1880-1900 produced by Pleiades Software. The comparison of our preliminary linked data to the Pleiades linked data disclosed that we had produced accurate linked data, but our process was highly dependent on the precision on the data. More specifically, if the individuals that we attempted to link were accurately enumerated in both census years, we would either make the link or, in the case of multiple potential links, reject the link as ambiguous. But if the correct link was unidentifiable — because of mortality, underenumeration, or misenumeration of linking variable information — we would make an incorrect link if there was another person with similar characteristics in the 1880 complete-count data.
We dealt with this problem by constructing formal measures for the commonness of names. A fairly standard approach in record linkage projects is to construct frequency tables for names, which more or less assesses the probability of a correct link. For example, based on frequency tables, record linkers would be more confident in linking someone with the name Roland Marsupial as opposed to John Smith. But given our minimalist approach to name cleaning and standardization, we run into a problem with minor typos and misspellings which would show up as low frequency names, although many of these names have high similarity to high frequency names (and would in fact show up as potential links to records with high frequency names). Our solution was to construct name similarity scores based on the following: for a given sample record we determined the proportion of records (by race, birthplace, and sex) in the 1880 complete-count data that have a Jaro-Winkler similarity score greater than 0.9. The choice of this threshold is somewhat arbitrary, but based on the preliminary linked data, we rarely linked records that did not exceed this threshold. We also constructed a density of birth measure, which is the proportion of 1880 records for specific birthplaces, by race and sex.
However, there is one other basic factor to take into account when evaluating the name commonness scores. The basic pattern of a higher linkage rate for less common names is evident regardless of birthplace rank. For records in the least common name category, the birthplace categories do not show much difference in linkage rates. But as we move to the more common name categories differentials between less and more populated states become more extreme. For example, only the most common name categories show a linkage rate significantly lower than the average for all records from the small states.
We expected to see this pattern in the linked data. The pattern largely results from our decision not to link in ambiguous cases (i.e., where we had more than one potential link). But by generating name commonness scores for all records, we have also forced the classifiers not to make links in cases where a name is relatively common, even in the absence of competing potential links. We do have a biased sample of linked data; in general, we are more likely to link records from the smaller states, and we are more likely to link records with less common names. We always assumed that we would deal with linkage differentials by constructing weights for the linked records; that all things being equal, linked individuals born in Delaware would have a lower weight than those born in New York. But we also assumed that there would not be an inherent bias between more versus less common names. The construction of the name scores allows us to examine this in a fairly straightforward way.
Construction and use of weights: The purpose of the linked sample weights (LINKWT) is to compensate for the under- and over-representation of linked cases. These weights are designed to produce population estimates for the universe of potential links in the sample's terminal year. For example, in the male 1880-1900 linked sample, LINKWT produces population estimates of males who were at least 20 years old in 1900, and who were in the United States in 1880. LINKWT achieves this representativeness both by inflating the value of all cases (since the cases come from a 1% sample), and by applying higher weight values to some cases than to others.
LINKWT: indicates the number of persons represented by each linked case. LINKWT values range up to 5000 and are only constructed for the primary links.
The creation of LINKWT was necessary because some records were clearly more likely to be linked than others. For instance, persons from a birth-state with a small population always had a higher likelihood of successful linkage than persons from a birth-state with a large population. The larger the population for a place of birth, the more likely it was that the linkage engine would find more than one potential link for a given record in the sample data. When the linkage procedures revealed more than one high-quality potential link, no link was made and the case was discarded.
To account for these and other types of biases, the LINKWT variable adjusts the value of each case based on the following characteristics: relationship to household head, birthplace, age (divided into 5-year age groups), population size of place, state or country of birth, occupation, and marital status (for the female samples).
We construct the weights based on terminal year population characteristics (i.e., 1880 characteristics for the pre-1880 samples and by sample year characteristics for the post-1880 samples). Weights are based on an estimate of the "linkable" population. Using the 1870-1880 samples as an example, the linkable population for native-born groups is anyone that was 10 years or older in the 1880 census. The universe in the linked female files represent women who were in the U.S. in both census years and who did not change their surname due to marriage (or remarriage) between the two census years. The linked male and couples files however, represent all men and married couples who were in the U.S. in both census years. Because year of immigration information is not available before 1990, additional adjustments were necessary for foreign-born persons. For a detailed description on sample weights in the United States data linking project, refer to our forthcoming article in Historical Methods journal and to the IPUMS USA website.
The basic procedure for producing linked Norwegian samples is similar to the United States linked samples. However, there are certain unique aspects of the Norwegian datasets that required modifications to our method. The following discussion covers the 3 linked male samples and 3 linked couple samples we have created as of summer 2010 (1865-1875, 1875-1900, and 1865-1900). Each linked data file contains variables drawn from both census years; variables from the earlier census year receive the the suffix "_1", while variables from the later file receive the suffix "_2".
Linking strategy for Norway: The linking strategy relies on four variables (instead of five for the United States) that should theoretically not change over time: birth year, municipality of birth, given name, and surname. Records were only compared in the linking process if they had an exact match on municipality of birth. The age and name variables, on the other hand, were permitted to have some variation similar to the United States. Age was allowed to be up to three years higher or lower than would be expected for the male samples, and five years for the couple samples.
Identifying potential links: Unlike for the US, we did not use FEBRL software to identify potential links for Norway. Instead, we created stand-alone perl scripts that contained the Jaro-Winkler name similarity algorithm, the same algorithm we accessed when using FEBRL to calculate name similarity scores and list potential links. We also calculated age similarity scores using PERL scripts. We made these changes for Norway as we had found that file processing was quicker for these tasks when compared to using the FEBRL software.
De jure versus de facto census: In a de jure census people were enumerated according to their regular or legal residence. A de facto census enumerates people where they are found on census night. In Norway, the 1865 census was de jure; in fact this was the last Norwegian census to include only the de jure part of the population. The 1875 census of Norway was the first census that introduced the explicit distinction between the de jure and the de facto population, with special fields for noting the whereabouts of absent people and the origins of temporary residents. The 1900 census for Norway was both de facto and de jure, similar to the 1875 census, and includes both persons absent from their households and people temporarily visiting other households. To address this issue and to avoid double counting of people, we employ special logic. The NAPP database provides the RESIDENT variable that defines the usual or temporary residence of a person in the household. If we link one record in 1865 to two different records in 1875 we consider the RESIDENT variable to see if it is possible that the two 1875 records represent the same person. If the two 1875 records have certain combinations of RESIDENT codes, we retain records according to the following order of precedence: permanent residents present on census day; permanent residents absent on census day; and finally, persons temporarily present on census day.
Patronymic naming schemes: Two different naming schemes were in place in Norway during the censuses of 1865, 1875, and 1900. Most commonly, individuals received a patronym, a component of a personal name based on the name of one's father. Wilson (son of William) and Carlsson (son of Carl) are common. Sigrid Håkonsdatter was the daughter of Håkon and she would retain her name upon marriage. Gradually, patronyms were replaced by a system of inherited family names, in which a man passed his surname onto to his wife and children. The use of family names was rare prior to 1900 (except among the urban upper class) and the use of a fixed family name was not made compulsory by law in Norway until 1925.
In addition, a third name was often used in Norway, usually a farm name. This "surname" did not necessarily identify a family or a relationship; instead, it signified a place of residence. If farmer Ole Olsen Li moved from Li to another farm, such as Dal, he would then be known as Ole Olsen Dal.
In general, patronyms cause little difficulty for linking individuals. We truncated surnames prior to the patronymic suffix (Andersen to Ander) in order to minimize inconsistencies in the spelling of suffixes or because of migration. Although a small number of individuals may have changed their names during the transition from patronymic to family names, this transition typically occurred with the next generation and should have little impact on linking rates in the late 19th century. Males in both the linked male and couple files must match on first, middle, and surname, while females in the linked couple files must match on first name and middle initial.
Allowable age matches in the Norwegian censuses: The 1865 census of Norway asked for each person's age, whereas the 1875 and the 1900 censuses recorded year of birth. Although we can construct the AGE variable from birth year (BIRTHYR), most of the time there is an offset of 1 year. A 20-year old in the 1865 Norwegian census could report a birth year of either 1845 or 1846 in the 1875 census, thereby making his age either 30 years or 29 years in the 1875 census. During the testing phase we found little difference in linking rates when using a 9-year or 10-year age difference. The final linked samples adjust the earlier sample age by 10 years when linking between 1865 and 1875, 25 years between 1875 and 1900, and 35 years between 1865 and 1900.
We initially applied the U.S. rule, allowing final links to occur as long as the later age fell within 7 years of expected age. After evaluating these links using indirect measures of the false link rates, we concluded that a more conservative approach was necessary. We limited the final male datasets to links where the dissimilarity in age was no greater that +/- 3 years. the linked couple files allow an age window of +/-5 years.
Changing municipality boundaries in Norway: Usually, to link a person across two Norwegian censuses we require an exact match on municipality of birth. Municipal boundaries in Norway changed frequently, however, and a large number of new municaplities were formed between the 1865, 1875, and 1900 censuses. Boundary changes were most common in urban areas and in the southwestern part of the country. To improve linking accuracy, we combined newly formed municipalities with the municipalities from which they originated. Additional information on boundary changes is provided here.
Variable sampling density of the 1875 sample: The 1875 census sample includes 100% of households in large cities and the northernmost provinces and a 2% sample of households in the remainder of the country. As a result, linking rates vary geographically.
Construction of linking weights in Norway: We followed the U.S. method for constructing weights with two modifications. Weights are constructed separately for the 2% and 100% samples in 1875. In addition, residence status (whether a person is enumerated in their usual residence) is included as an additional weighting factor.
Large households in Norway: The Norwegian census samples include a small number of extremely large households. Many of these household contain multiple subfamilies each beginning with a person identified as the head. For ease of use, we have split these large households into smaller subfamily units. The household serial numbers and the person number in the first census sample have been altered for persons in these households, but the original serial numbers and person numbers are stored in the variables "SERIAL_ORIG_1", "SERIAL_ORIG_2" and "PERNUM_ORIG_1". These changes have implications when used in conjunction with other variables. Be aware that the pointer variables, MOMLOC, POPLOC, and SPLOC from the first census year only are based on the original person numbers and therfore should be used with " SERIAL_ORIG_1" and "PERNUM_ORIG_1". Also, several constructed household variables that contain descriptive household counts, such as NFAMS, were calculated before these extremely large households were divided for the linked data files.