Frequently Asked Questions (FAQ)
What is NAPP?
What are the sources of NAPP data?
Do enumeration rules differ between the different censuses?
How does NAPP add value to the original census data?
Where should a new user start?
How do I get access to NAPP data?
What are the restrictions on use?
Can the NAPP data be used for genealogical purposes?
What are microdata?
What is the subject content of NAPP data?
How is occupation/industry harmonized across various censuses?
What are "integrated variables"?
What are "unharmonized variables"?
What are "pointer variables"?
Does NAPP data have "weights"?
What does "universe" mean in the variable descriptions?
How do I obtain data?
What format are the data in?
What is the best way to use the extract system?
How long does a data extract take?
What if the samples are too big for me to handle?
How does "sample selection" work on the NAPP web site?
What does "Add to cart" mean?
Why can't I open the data file?
Is there a preferred statistical package for using the NAPP data?
Can I get the original data?
How is a record uniquely identified?
Using NAPP data
Are there tricky aspects of NAPP data to be particularly aware of?
What are the major limitations of the data?
Can I study multiple countries? One country?
There were questions asked in the census that I don't see in NAPP. Where are the data?
Can I find particular individuals in the NAPP data?
Using the variables page
Variables page menu
Variables page details
Using the data extract system
Your data cart
Why are some variables in my data cart preselected?
What is "Type"?
Extract request page
Extract definition: Data format
Extract definition: Data structure
Extract option: Customize sample sizes
Extract option: Select cases
Extract option: Attach characteristics
Extract option: Describe your extract
General information about the project
What is NAPP? [top]
NAPP (North Atlantic Population Project) is a machine-readable database of historical complete-count and sample census data for countries of that region. These data have only recently become available for social science research, and they collectively comprise our richest source of quantitative information on the population of the North Atlantic world in the late nineteenth century.
To allow the comparative analysis of human behavior across these countries, the NAPP collaborators have harmonized the record layouts, coding schemes, and documentation for the different censuses. NAPP assigns uniform codes across all the censuses and brings relevant documentation into a coherent form to facilitate analysis of social and economic change. NAPP data are compatible with the existing IPUMS series of U.S. census samples and are the foundation for a long-term collaborative enterprise to reconstruct the population of this region from the mid-nineteenth century to the present. Scholars interested only in the United States are better served using IPUMS-USA, which is optimized for U.S. research.
What are the sources of NAPP data? [top]
Creation of complete count census databases for research builds on the existing samples of historical censuses in all six countries. Complete count 1880/81 censuses of the United States, Great Britain, and Canada were transcribed in the 1980s and 1990s by the Church of Latter-Day Saints (LDS) for genealogical purposes. LDS donated the data for academic use in exchange for cleaning and enhancing the data. The Norwegian, Swedish, Icelandic, Danish, and other Canadian censuses were digitized by academic researchers and archivists.
Our research and data partners in the participating NAPP countries are listed here. The Minnesota Population Center also has agreements with the LDS, who allow us to freely distribute the data to academic researchers in exchange for cleaning and enhancing the data.
Do enumeration rules differ between the different censuses? [top]
Yes, enumeration rules differ between the different censuses, which have some implications for understanding patterns of household relationships.
A de jure census involves enumerating the population at their permanent residence. Thus, people temporarily away from home are enumerated by the people remaining at home who provide the census enumerator their information. A de facto census enumerates people where they are found on a census night. Norway and Iceland were unusual in using both rules for some censuses. This means some people are enumerated twice and identified as such in the data. Users should exclude one set of similar individuals for accurate population statistics. For Norway 1875-1910 and Iceland 1901 and 1910, the individuals who are enumerated twice are identified in the variable RESIDENT. More information on Norwegian enumeration practices can be found here.
For Sweden, the censuses were not taken by census workers going out with a questionnaire and interviewing people in their homes on a particular day. Instead the censuses were taken by vicars and parish priests who made extracts from the already existing church examination records (1860-1890) and parish books (1900-1945).
How does NAPP add value to the original census data? [top]
The process of integration itself adds value to the data by fully documenting all codes and compiling all variable documentation in a hyperlinked web format. But we do many other things as well:
NAPP creates a consistent set of constructed variables for all samples. Most important are the family interrelationship "pointer" variables that indicate the location within the household of every person's mother, father, and spouse.
NAPP is working on missing data allocation and consistency edits on important variables. This kind of data editing performs logical fixes when possible or finds a donor record that shares key characteristics with the person in question and substitutes their response for the missing variable. Allocation is a more statistically sound way to deal with missing data than simply excluding such cases from analyses.
Where should a new user start? [top]
The natural starting point is the " Select Data" or "Browse and Select Data " links on the left navigation bar and the top banner. These links open the variables page: the primary tool for exploring the contents of NAPP. By default, the variables page displays one variable group at a time for all samples in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. You can filter the information at any point to include only the samples of interest to you ("Select samples.") Initially, the variables screen is set to display the integrated variables. Select the "view unharmonized variables" button to browse the variables that are specific to individual samples. More detailed information on using the variable menu is available.
When you select samples, the page will display only variables present in those censuses. An "x" indicates the availability of a variable for a particular sample.
On the variables page, clicking on a variable name brings up its documentation. The information about the variable is contained on a number of tabs. The default tab is the brief description of the variable. More information is usually available on the "comparability" tab, which discusses international and intra-national comparability issues. The "questionnaire text" tab compiles all the questionnaire text and instructions pertaining to the census question for every sample. The variables page also has direct links to the codes page for each variable (they are also accessible as a tab in the variable description). The codes page shows the codes and labels for the variable, and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.
Throughout the variable documentation system there are buttons to "Add to cart." Any variables you select in this way are put in your data cart to include in a data extract. Your selections only last for the current web session.
The Data Cart in the upper right keeps track of your variable and sample selections. Once you have made some selections you can click on "View Cart" to review your choices. If you have selected variables and samples you can enter the data extract system. To make a data extract you must be registered to use NAPP. The instructions for the extraction system are here.
How do I get access to NAPP data? [top]
Access to the documentation is freely available without restriction; however, users must apply for access to the data. The application system requires a description of an applicant's proposed research and asks for the user's institutional affiliation and other information to verify identity. Every application is individually reviewed by project staff. We may ask for additional information if we are uncertain about the suitability of the intended research. Applicants are required to agree to a number of conditions to use the data. Access to the system enables a user to extract data from any country in the database. To apply for access go here.
What are the restrictions on use? [top]
Our agreements with the various countries and the Church of Latter-Day Saints (LDS) require that NAPP data be used only for scholarly and educational purposes. Commercial use of the data is prohibited. To gain access, applicants are required to agree to a number of conditions that amount to a legal contract. Chief among them are a prohibition against redistributing the data, citing NAPP data appropriately for publications and research reports, and not using the data for genealogical purposes.
Can the NAPP data be used for genealogical purposes? [top]
No. NAPP data cannot be used for genealogical purposes.
The original transcription of the United States 1880, British 1881, and Canadian 1881 censuses was undertaken by the Church of Jesus Christ of Latter Day Saints (LDS) who provided the data for social scientific research with the stipulation that it not be available for genealogical research. Similarly the Swedish censuses were transcribed by the National Archives of Sweden, who charge for access to the data for genealogical purposes. We screen all applications for use to ensure that genealogists who wish to use the data are directed to a more appropriate site.
The LDS provide a free searchable interface to the data at Family Search. Please use this website for searching these three censuses. The Norwegian Historical Data Center maintains a searchable interface to the Norwegian censuses in Norwegian and English. A searchable interface to the Swedish censuses is available.
Our continuing provision of the data for social science research relies on these provisions against genealogical use of the data being upheld. Misusing the data by violating any of the conditions detailed above constitutes a violation of the user agreement and may lead to professional censure, loss of employment, or civil prosecution under relevant national and international laws, and to sanctions against your institution, at the discretion of the University of Minnesota and the other institutions collaborating on the North Atlantic Population Project.
Further resources for genealogical research
For further resources on genealogical research, please see the National Archives of the following participating countries:
Other limited genealogical information can be found at Ancestry, Rootsweb, 1901 Census of England and Wales, and Index to the 1901 Census of Canada. Please note that links to these sites do not imply an endorsement of their contents by the North Atlantic Population Project.
What are microdata? [top]
Census microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different census questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the NAPP data.
The NAPP data is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since NAPP includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the national statistical agencies are often not comparable across time, and, until recently, the subject coverage of census publications was limited.
What is the subject content of NAPP data? [top]
The data series includes information on a broad range of population characteristics, including fertility, nuptiality, life-course transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition.
Users should note that not all the subjects enumerated in the censuses were transcribed for the machine readable datasets available through NAPP. The British, Canadian, and United States' data was transcribed by the Church of Jesus Christ of Latter-day Saints, and they omitted some variables. For example, in the United States information on school attendance, literacy, months unemployed, and sickness was omitted. Similar information on school attendance and "infirmities" was omitted in the transcription of the 1881 Canadian census.
How is occupation/industry harmonized across various censuses? [top]
Occupation and industry are among the most important variables for analyses of comparative social and economic structure because these censuses provide few alternative indicators of socioeconomic status or labor-force participation. The national statistical agencies employed quite different classification systems. With access to the individual-level occupational responses for all people in the censuses, we have coded occupations into a common classification system based on the HISCO classification scheme. In Great Britain, Norway, and the United States, we also have coded occupations into domestic coding schemes.
What are "integrated variables"? [top]
Integration -- or "harmonization" -- is the process of making data from different censuses and countries comparable. For example, most censuses ask about marital status; however, they differ both in their classification schemes (one census might recognize only a general category of "married," while another might distinguish between civil and religious marriages) and in the numeric codes assigned to each category ("divorced" might be coded as a "4" in one census and as a "2" in another). To create an integrated variable for marital status we recode the marital status variable from each census into a unified coding scheme that we design. Most of this work is carried out using correspondence tables.
Because some censuses provide more detail than others, a coding scheme that reduced variables down to the lowest common denominator across all samples would inevitably lose important information. As a result, many NAPP integrated variables use composite coding schemes. The first one or two digits of the code provides information available across all samples; the next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. All meaningful detail in the original enumerations is therefore available to researchers if they need it, but they can confine their attention to the less-detailed digits if they wish.
Another component of integration is the variable documentation. The documentation aims to highlight important comparability issues that are not self-evident from the coding structure for the variable. A general comparability discussion emphasizes issues for international comparisons, and country-specific discussions note comparability concerns when making intra-national comparisons over time. NAPP staff must exercise their judgment in composing this documentation -- there is no formula for it. But users need not depend totally on us: the variable documentation provides links to both English-language and original-language census questionnaires and instructions. This material is readily available on every variable description page through the link to "enumeration text."
What are "unharmonized variables"? [top]
All variables in NAPP are processed to varying degrees. They are documented in English and associated with the relevant sections of the original census instructions. The data are analyzed and often recoded for technical and other considerations. But, not all NAPP variables are "integrated" for international and inter-temporal comparability.
The regular NAPP variables -- the ones on the main variable availability screen -- are integrated: the same codes and labels apply across all the samples that contain the variable. Unharmonized variables, in contrast, are unique to each sample. They generally correspond to the variables in the original datasets submitted to the NAPP project by the various countries. The unharmonized variable codes and labels are not consistent across samples, but the variables have been processed to make them more regularized. Stray values are recoded; all data are converted to numeric values; data universes are empirically determined; unknown and NIU categories are coded consistently. In addition, each unharmonized variable is assigned a unique name in the NAPP database, and the value labels and other variable documentation are written in English.
Many unharmonized variables serve as inputs for the integrated NAPP variables. For example, underlying the integrated variable for marital status, MARST, are numerous unharmonized variables, typically one per sample -- CA81A428 for Canada1881, GB81B409 for Scotland 1881, and so forth. Each integrated NAPP variable description has a link to the unharmonized variables that served as its input. The unharmonized variables are also accessible in a comprehensive list using the button near the top of the variables page. The variable description for each unharmonized variable lists the integrated variables for which it provides the source data.
The unharmonized variables can be included in data extracts. Thus researchers can get both the integrated and unharmonized forms of specific variables (for example, the internationally comparable employment status variable, OCC, and the employment status variable specific to 1900 Norway, NO00A434). Perhaps more importantly, the unharmonized variables give researchers access to data that NAPP has not been able to incorporate in an internationally comparable manner.
What are "pointer variables"? [top]
The NAPP family interrelationship "pointer" variables indicate the location within the household of every person's mother, father, and spouse. Nearly all samples indicate the relationship of each person to the head of household, but it is much harder to relate individuals to persons other than the head (for example, grandchildren to children, sons-in-laws to daughters, or unrelated persons to each other). We have developed a complex core algorithm to make such connections, and we customize it as needed to account for peculiarities of specific samples. The pointer variables are called MOMLOC, POPLOC, and SPLOC in the NAPP system. The variables MOMRULE, POPRULE, and SPRULE indicate the conditions under which a specific link was made. The parental pointer variables identify social parents, not strictly biological ones.
The pointer variables make it possible to construct individual-level variables representing the characteristics of co-resident persons, such as occupation of spouse, age of mother, or educational attainment of father. The data extraction system can perform this step for you. The "Variable options" step includes a feature to "Attach characteristics" of other persons in the household based on the pointer variables. These attached characteristics appear as new variables in your data extract. For maximum flexibility, you can also do this matching yourself. You need to include the serial and person ID variables (SERIAL and PERNUM) in your extract, as well as the pointer variables themselves, to perform the necessary data manipulations.
Does NAPP data have "weights"? [top]
Most of the NAPP data are complete-count data. The use of weights is not required for generating accurate population statistics. However, the 1875 census of Norway and 1871 and 1901 censuses of Canada are samples and have person weights (PERWT) assigned to them. For all other complete count data, PERWT is set at 1. For users who are merging NAPP data with IPUMS data should keep in mind that PERWT is referred to as WTPER in IPUMS International.
What does "universe" mean in the variable descriptions? [top]
The universe is the population at risk of having a response for the variable in question. In most cases, these are the households or persons to whom the census question was asked as reflected on the census questionnaire. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. In some instances the universe suggested by the census questionnaire is not accurate. However, because of post-enumeration data processing, NAPP empirically verifies universes to obtain the most accurate statement possible of the universe. In some cases, there is no independent information in a sample to verify a universe.
Cases that are outside of the universe for a variable are labeled "NIU (not in universe)" on the codes page. Differences in a variable's universe across samples are a common data comparability issue.
The universes will not always be free of apparently erroneous cases. Some persons or households that should not have answered the question did, and some that should have answered may be included in the "NIU" category. But until we perform comprehensive data editing and allocation, we do not know whether the variable in question is in error, or whether the variables that define the universe (for example, age or employment status) are incorrect.
Additionally, users should note the following definitions of the universe for each particular country:
Canada: Individual-level data on the Indian population in the territories is not available. The published census volumes provide aggregate information on these populations who were not individually enumerated. Thus, the published population totals for Canada in 1881 exceed the number of individuals in the 1881 Canadian census data file.
Denmark: The census includes the complete population of the Kingdom of Denmark.
Great Britain: The census includes the population of England, Wales and Scotland; including "Islands in the British Seas" (viz. the Channel Islands and the Isle of Man). Ireland is not included.
Iceland: The census includes the population of the main island of Iceland, and off-shore islands.
Norway: The census includes the complete population of Norway, including indigenous minorities in the three northern provinces of Troms, Finnmark and Nordland.
Sweden: The Swedish census was fundamentally different than censuses in other countries. In Sweden the censuses were not taken by census workers going out with a questionnaire and interviewing people in their homes. Instead the censuses were taken by vicars and parish priests who made extracts from the already existing parish books. The parish books were updated continuously by the vicar or parish priest. They kept track of persons, families and households, their birth, marriage, death, and recorded whenever a person moved within or between parishes. The priest also recorded a person's attendance to the church examinations, their knowledge in the Christian teachings, ability to read and write and many other things. The parish books were kept in all of Sweden except in the city of Stockholm where the censuses were based on the tax census.
United States: All states, territories that eventually became states, and the District of Columbia, are included. However, the census enumeration excluded "Indians not taxed," and thus does not contain Indian Territory, contiguous with the present-day state of Oklahoma. Alaska and Hawaii were also excluded from the enumeration.
How do I obtain data? [top]
All NAPP data are delivered through our data extraction system. Instructions for the data extraction system are available here. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. The system will pool data from multiple samples into a single data file; in fact, it was primarily designed for this purpose.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. The extract system is accessed here. If you are not yet registered, you can log in as "guest" and examine the interface.
What format are the data in? [top]
NAPP produces fixed-column ASCII data. Data are entirely numeric. By default, the extraction system rectangularizes the data: that is, it puts household information on the person records and does not retain the households as separate records. No information is lost, and this is the format preferred by most researchers; however, it can be overridden in the extract system to yield hierarchical data.
In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and STATA are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer. Alternatively, you can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
What is the best way to use the extract system? [top]
The data extraction system is a flexible tool. There is no need to download variables or samples you don't expect to use for your current analysis. The system records every extract you make. You can reload and modify an old extract, dropping or adding variables or samples. Go to the "Download or Revise Extracts" page and click on the "Revise" link.
Since most of the NAPP samples are complete count data, if you choose lots of variables in a large sample or several samples at the same time, you can make extremely large extracts that will be cumbersome to analyze. The extract system is designed to minimize this problem. The "Extract Request" screen predicts the size of your data extract and provides options for reducing the size of your dataset. The system will inform you if your extract violates the maximum size allowable.
Some variables are preselected for you. They identify the sample, in case your extract pools data from multiple sources, as well as other technical variables. Some of the samples are truly weighted, with different records representing more persons in the population than others, so the system preselects the person weight variable (WTPER).
How long does a data extract take? [top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts can take from a few minutes to a couple of hours or more. The system sends an email when the extract is completed, so there is no need to stay active on the NAPP site while the extract is being made.
What if the samples are too big for me to handle? [top]
It is possible to make samples that are extremely large. Extracts over a certain size will not be allowed by our server. If you have a legitimate reason to make larger extracts, email us to request a higher threshold.
Roughly speaking, the number of records times the number of columns of data requested yields the file size of your extract in bytes.
The "extract request" screen predicts the size of your data file. If it is too large for your purposes, there are several things you can do.
1) Select fewer variables and/or samples.
2) Use the "Select cases" feature on the extract page to include only the particular kinds of people or households you want to include in your data. (Note: the estimated file size does not include case selection in its calculation, so this number will not change.)
3) Use the "Customize sample sizes" feature to draw smaller subsets of some or all of the samples in your extract. Entering numbers in any of the cells in the right half of the screen will tell the system to systematically extract the corresponding number of households from the selected sample(s). The households will be drawn evenly from across the entire country, and the sample weights in the data will be adjusted appropriately.
How does "sample selection" work on the NAPP web site? [top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples while browsing the variables, the extract system will pre-select those samples.
What does "Add to cart" mean? [top]
While browsing variables in the documentation system, you can select them to include in a data extract, sending them to your data cart. During the data extract checkout process these variables are labeled "Include in extract" and unchecking the box will deselect the variable. Once in the extract system, you can return to the variable list to make more selections.
Why can't I open the data file? [top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the default ASCII data file directly with a statistical package. The extract system generates a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data. More detailed instructions for the downloading and reading the data are available here.
Alternatively, you can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page.
Is there a preferred statistical package for using the NAPP data? [top]
NAPP supports SPSS, SAS and STATA. . By default, the extract system generates an ASCII data file (.dat) and provides SPSS, SAS, and STATA syntax files with which to read the data. You can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv) on the Extract Request page
Can I get the original data? [top]
In accordance with our agreements with the host countries and the Church of Latter-Day Saints (LDS), NAPP does not distribute the original samples provided by our international partners. We do provide access to unharmonized variables, but even these have undergone processing. We clean up stray codes, translate the labels into English, document the variable, and sometimes perform additional programming. We take care not to lose meaningful information in these transformations, so researchers retain access to the full power of the original information in the input variables.
How is a record uniquely identified? [top]
Three variables constitute a unique identifier for each record in the NAPP: SAMPLE, SERIAL, and PERNUM (sample, household index number, and person index within household). SERIAL is a unique household identifier.
Using NAPP data
Are there tricky aspects of NAPP data to be particularly aware of? [top]
It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. Read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.
By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. You can either select the first person in each household (PERNUM) or select the "hierarchical" box in the extract system to get the proper number of household observations. The rectangularizing feature also drops any vacant households, which are otherwise available in some samples. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.
What are the major limitations of the data? [top]
The data are composed entirely of individual person and household records from population censuses. There are no macroeconomic, business, or aggregate statistics. We do not deliver the published statistics from the population censuses.
Most of the NAPP samples are complete count census data, resulting in larger extract sizes. If you are interested in obtaining a sample of the complete count data, change the "Sample size" near the end of the data extract process. More information on dealing with large extract size can be found here.
Can I study multiple countries? One country? [top]
NAPP is designed to facilitate cross-national and cross-temporal research, but there is no restriction against single-country studies. Data extracts can contain records from every country in the entire data series, or records from only a single country. Researchers interested in studying only the United States, however, would be best served by going to IPUMS-USA, which is optimized for U.S. research and has greater temporal depth.
There were questions asked in the census that I don't see in NAPP. Where are the data? [top]
The Church of Latter-Day Saints (LDS), in collaboration with local genealogical societies, laboriously digitized the censuses of Great Britain, Canada and the United States, to provide a resource for genealogical research. Some of the variables have not been digitized by the LDS and are not included in the NAPP datasets. Also, some countries did not supply variables for every census question. The census responses may never have been processed or digitized. It is also possible we have the data, but have insufficient metadata to make it available at this time. When we add a sample, we harmonize only the core group of variables that we think most researchers would desire. Other variables can be accessed through the unharmonized variables.
Can I find particular individuals in the NAPP data? [top]
Yes. Samples will contain names (NAMELAST, NAMEFRST) and addresses (ADDRESS). However, some data are only samples, so there is no guarantee any given individual will be in the dataset.
Please note that NAPP data can be used only for academic research. Use of the data for genealogy is expressly prohibited in the user license agreement to which all persons must agree.
Using the variables page
Variables page menu [top]
Use the left side of the menu to browse variables:
Household: household variables by group
Person: person variables by group
A-Z or Sample: integrated variables by letter/ unharmonized variables by sample
Search: display only variables that contain specified text in particular fields
Use the buttons and links on the right side of the menu to:
Select Samples: limit the display of variable information to selected samples
View . . . Variables: toggle between viewing integrated and unharmonized variables
Options: alter how the variable list is displayed or get help for this page
Variables page details [top]
The variables page allows you to browse integrated and unharmonized variables while limiting and controlling how the information is displayed.
When you "Select samples" you limit the variable list to display only variables that are available in at least one of those samples. But the effect of selecting samples extends into all the variable descriptions and codes pages you can access through the variable system. Only information relevant to your selected samples will be displayed in any context while you browse the variables. You can change your sample selections at any point.
Selecting samples is a good practice when exploring NAPP because the amount of information can be unwieldy. On the other hand, sometimes you need to see everything to determine what kind of research is possible using the database.
"Search" lets you specify search terms for specific fields of variable metadata. The system will return a list of variables that include any of the search terms you indicate. Both integrated and unharmonized variables can be searched.
The final choices are "Options" and "Help." The "Options" item brings up a screen that offers a number of choices regarding the display of the variable list. Each selection has a default choice.
Use short country codes / Use long country codes
Switch between the 2-letter country abbreviations and longer abbreviations. The short codes are the default.
View one group / View all groups together
Switch between viewing one variable group at a time and viewing all variable groups on one screen. Unless you have a limited number of samples selected, your browser may be slow to display all groups. The default view is one group at a time.
Show availability detail / Show availability summary
Switch between displaying the full sample-specific availability matrix, and a view that only displays the total number of samples that contain each variable. Both views only display or sum the samples that the user has selected in "Select samples." The default view is the detailed availability information. This option is disabled while viewing unharmonized variables.
View available variables / View all variables
Switch between a view that only displays variables present in one of your selected samples, and a view that displays every variable, even if they are not available. The default view is to only display available variables.
Use long unharmonized variable names / Use short unharmonized variable names
Switch between descriptive names for the unharmonized variables (up to 16 characters) and cryptic 8-character names. The default view is to display long names.
Samples are displayed oldest to newest / Samples . . . newest to oldest
Display the samples columns indicating variable availability in chronological order or reverse chronological order. The default is oldest to newest.
The Variable List
As you browse the variables, they are displayed in a list containing a number of columns. The variable name links to the variable description, which includes detailed comparability discussions, universes, and enumeration text. The variable codes -- and their associated labels -- can be accessed directly using the "codes" links. The "type" column indicates if it is a person or household variable. In some contexts, like the alphabetic view, the two types are pooled together.
The area to the right of the "codes" column differs between integrated and unharmonized variables. For integrated variables, the default view displays a column for every sample that the user chose in "Select samples." By default, all samples are selected. The country abbreviation and last two digits of the sample year identify each sample at the top of every column. Hover over the country code with the mouse to see the full country name. If a variable is available in a given sample, an "x" is printed in that column.
The unharmonized variables by definition are only available in a single sample; therefore, they do not require detailed availability information.
In the column labeled "Add to cart", each variable has a yellow circle with a "+" on the far left. Click these circles to add them to your data cart (they will appear green when you hover over them to indicate that you may select it). Once you have clicked them, these icons change to a checked box, indicating that the variable is in your data cart. To remove the variable from your data cart, simply click the checkbox.
Using the data extract system
Your data cart [top]
You cannot create data from the extract system unless you are a registered user. If you are not registered, you must apply for access.
At the top right corner of the variables page is a summary of your data cart. This box displays the number of variables and samples you have selected. Clicking the yellow circle next to a variable places it in your data cart. You can view your data cart at any time by clicking "View Cart". The "View Cart" link only becomes operative when you have selected a variable or sample.
The data cart lists the variables preselected by the extract system as well as any variables you selected while browsing the documentation. As with the variable selection page, you can remove variables from your extract in this step by clicking the checkbox next to the variable in the "Add to cart" column. If you chose a variable but subsequently altered your sample selections in such a way that the variable is no longer available, it is indicated by an "i" icon.
The data cart also includes record type, links to codes pages, and sample availability for the variables in your cart.
Buttons are provided to return to the variable list to make more selections or to alter your sample choices. If you return to the variable list, click on "View Cart" again to return to the data cart.
When you are satisfied with your data selections, click "Create data extract" to finalize your extract request.
Why are some variables in my data cart preselected? [top]
Certain variables appear in your data cart even if you did not select them, and they are not included in the constantly updated count of variables in your data cart.
Unless you are absolutely certain you will not need one of these variables, we recommend that you not remove them from your data cart.
What is "Type"? [top]
The "Type" column on the variables selection pages and in your data cart indicates the record type of the variable. The variables with a "P" are from the person record, and the variables with an "H" are from the household record. Data at the household level pertain to each person in the household, and are identical on each person record within a household in the rectangular data file.
Extract request page [top]
When you click "Create data extract" in the Data Cart, you come to the Extract Request page. All of the actions on this page are optional. If you wish, you can simply hit the "Submit" button and create your data extract. You will be prompted to log in if have have not done so already.
The page summarizes your data extract and provides a number of options for customizing it. A link at the top expands to show the samples you selected. If any samples have notes associated with them, a message will appear on the samples bar to encourage you to review that information. Click the appropriate links to go back to the variable browsing and sample selection pages to alter your choices. You return to the extract request page via the data cart, where you can review the availability matrix for selections and easily drop variables by unchecking them.
All data extracts include a text data file (fixed-width format), along with Stata, SPSS, and SAS syntax files to load those data. On this page, you can elect to receive the data in an alternative format.
A separate link lets you choose the preferred data structure for your extract: rectangular or hierarchical. Rectangular format is the default.
Another row on the page estimates the size of your extract. If the estimated size is too large, click on the link to reduce extract size. Two of the methods for reducing the size of extracts involve options buttons on the lower half of the extract request page.
When you submit an extract, there will be a delay ranging from minutes to hours, depending on the size of the job. You do not need to wait on our site for the job to be completed. Our system will send you an email when your extract is ready.
The definitions of every extract will remain on our server indefinitely, but the data files are subject to deletion after three days. However, the screen where you download extracts has a feature that lets you revise old extracts. When you click on "revise," all your selections for that extract will be loaded into the system, after which you can edit or regenerate it. Note, however, that each successive data release can create difficulties for recreating old extracts, because codes might change.
Extract definition: Data format [top]
By default, the extract system generates an ASCII data file (.dat) and provides SPSS, SAS, and STATA syntax files with which to read the data. You can request your data formatted for SPSS (.sav), SAS (.sas7bat), STATA (.dta), or as a comma delimited file (.csv).
Extract definition: Data structure [top]
You can choose the preferred file structure for your extract. Rectangular data only contain person records -- requested household information is attached to each household member. Hierarchical data contains a distinct household record followed by a separate person record for each member of the household. The system defaults to rectangular format, which is the overwhelming choice of researchers.
Vacant housing units can only be extracted using the hierarchical data structure.
Extract option: Customize sample sizes [top]
Near the bottom of the screen is the expected size of your data extract. Note: the predicted extract size does not take account of any case selection you may have implemented using the "Select cases" option. If you used case selection, your extract probably will be smaller than the size reported on this screen.
To alter the size of a sample, enter the desired number in one of its boxes in terms of households, persons, or sample density. For any sample, you can enter only one number to define the density; the other two cells will be calculated from that number. An entry in the first row of boxes, "All samples", will apply the same selection to every sample in your extract. The minimum number of cases for any sample is 10,000 households. If you enter a number larger than the number of cases in a sample, the tool will indicate the maximum number.
At any point you can clear your selections and return to the full sizes for every sample.
The sampling unit for the sample-size tool is the household. The system will draw a systematic sample of every Nth household -- after a random start -- at the proper density to produce the number of cases you requested. Your data extract will have altered weights that reflect the new sample densities. Thus, your subsamples will still be representative of the full population, but some divergence from the full-sample estimates should be expected, particularly for estimates of small geographic areas or uncommon categories of cases.
If a sample contains vacant housing units and you request the default rectangularized data structure, the actual number of households in your extract will fall somewhat below the number displayed here.
Extract option: Select cases [top]
The "select cases" feature allows users to limit their dataset to contain only records with specific values for selected variables, such as persons age 65 and older. Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by a logical "AND" for processing purposes. You can only perform case selection on either the general or the detailed version of a variable, not both.
Simply extracting selected cases can be too crude, however, because you may need the people who co-resided with your selected population. Accordingly, the case selection function also lets you choose to include everyone living in a household with a person with the selected characteristics.
Users should be careful with the case selection feature. It is possible to select a specific variable category (i.e., polygamous marriage) that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.
Extract option: Attach characteristics [top]
The data extract system can attach a characteristic of a person's mother, father, or spouse as a new variable on the person's record. It can also attach the characteristics of the household head. For example, using the variable "Occupation," it can make a new variable for "Occupation of mother." All persons in the extract who reside in a household with their mother would receive a value for this new variable. Persons without a mother present in the household would receive a missing value. The extract system automatically generates a unique name for the new variable.
The attached-characteristics feature uses the constructed NAPP family interrelationship "pointer variables" that identify co-resident mothers, fathers, and spouses for each person. The pointer variables identify social mothers and fathers, not strictly biological parents.
Extract option: Describe your extract [top]
You can describe your extract for future reference. Our system will display the description on the page where you download your data extract.