General information about the project
What is NAPP?
[top]
NAPP (North Atlantic Population Project) is a machine-readable database of the complete censuses of Canada (1881), Great Britain (1881), Iceland (1870, 1880, 1901), Norway (1865, 1900), Sweden (1900), and the United States (1880). These data have only recently become available for social science research. The nine censuses collectively comprise our richest source of quantitative information on the population of the North Atlantic world in the late nineteenth century.
To allow the comparative analysis of human behavior across these countries, the NAPP collaborators have harmonized the record layouts, coding schemes, and documentation for the different censuses. NAPP assigns uniform codes across all the censuses and brings relevant documentation into a coherent form to facilitate analysis of social and economic change. NAPP data are compatible with the existing
IPUMS series of U.S. census samples and are the foundation for a long-term collaborative enterprise to reconstruct the population of this region from the mid-nineteenth century to the present. Scholars interested only in the United States are better served using IPUMS-USA, which is optimized for U.S. research.
What are the sources of NAPP data?
[top]
Creation of complete count census databases for research builds on the existing samples of historical censuses in all six countries. Complete count 1880/81 censuses of the United States, Great Britain, and Canada were transcribed in the 1980s and 1990s by the Church of Latter-Day Saints (LDS) for genealogical purposes. LDS donated the data for academic use in exchange for cleaning and enhancing the data. The Norwegian, Swedish, and Icelandic censuses were digitized by academic researchers and archivists.
Our research partners in Canada are the
University of Montréal Department of Demography, the
University of Ottawa Institute of Canadian Studies, and the
University of Alberta. For Great Britain data, we partner with the
UK Data Archive. In Iceland, NAPP works with
Statistics Iceland.
University of Bergen (1865 Norwegian Census) and
University of Tromsø (1900 Norwegian Census) are our Norwegian partners. For the Swedish census, we work with
National Archives of Sweden,
Stockholms stadsarkiv, and
Umeå University. The Minnesota Population Center also has agreements with the LDS, who allow us to freely distribute the data to academic researchers in exchange for cleaning and enhancing the data.
Do enumeration rules differ between the different censuses?
[top]
Yes, enumeration rules differ between the different censuses, which have some implications for understanding patterns of household relationships.
A
de jure census involves enumerating the population at their permanent residence. Thus, people temporarily away from home are enumerated by the people remaining at home who provide the census enumerator their information. A
de facto census enumerates people where they are found on a census night. Norway and Iceland were unusual in using both rules for some censuses. This means some people are enumerated twice and identified as such in the data. Users should exclude one set of similar individuals for accurate population statistics. For the Norway 1900 census, the individuals who are enumerated twice are identified in the variable RESIDENT. More information on the Norwegian census can be found
here.
For Sweden, the censuses were not taken by census workers going out with a questionnaire and interviewing people in their homes on a particular day. Instead the censuses were taken by vicars and parish priests who made extracts from the already existing church examination records (1860-1890) and parish books (1900-1945).
Does NAPP add value to the original census data?
[top]
The process of integration itself adds value to the data by fully documenting all codes and compiling all variable documentation in a hyperlinked web format. But we do many other things as well:
NAPP creates a consistent set of constructed variables for all samples. Most important are the
family interrelationship "pointer" variables that indicate the location within the household of every person's mother, father, and spouse.
NAPP is working on missing data allocation and consistency edits on important variables. This kind of data editing performs logical fixes when possible or finds a donor record that shares key characteristics with the person in question and substitutes their response for the missing variable. Allocation is a more statistically sound way to deal with missing data than simply excluding such cases from analyses.
Getting started
Where should a new user start?
[top]
The natural starting point is the
"Variables" page linked on the left navigation bar and the top banner. The variables page is the primary tool for exploring the contents of NAPP. By default, the variables page displays one variable group at a time for all samples in the data series. You can change the view option to show all groups simultaneously, but the page can get very large and slow to load. You can filter the information at any point to include only the samples of interest to you ("Select samples.") Initially, the variables screen is set to display the
integrated variables. Click "view unharmonized variables" to browse the variables that are specific to individual samples. More detailed information on
using the variable menu is available.
When you select samples, the page will display only variables present in those censuses. An "x" indicates the availability of a variable for a particular sample.
On the variables page, clicking on a variable name brings up its documentation. It contains a description of the variable and discussions of cross country comparability issues. The "enumeration text" link compiles all the questionnaire text and instructions pertaining to the census question for every sample. The variables page also has direct links to the codes page for each variable. The codes page shows the coding structure and labels for the variable and the availability of categories across samples. These categories can suggest the types of research possible with a given sample.
Throughout the variable documentation system there are checkboxes and buttons to "Include in Extract." Any variables you identify in this way will be pre-selected for you when you enter the data extract system. Note that your selections only last for the current web session.
Click on "Get data" or "Create an extract" to enter the data extract system. To make a data extract you must be registered to use NAPP; otherwise, you can identify yourself as a "guest" and explore. The instructions for the extraction system are
here.
How do I get access to NAPP data?
[top]
Access to the documentation is freely available without restriction; however, users must apply for access to the data. The application system requires a description of an applicant's proposed research and asks for the user's institutional affiliation and other information to verify identity. Every application is individually reviewed by project staff. We may ask for additional information if we are uncertain about the suitability of the intended research. Applicants are required to agree to a number of conditions to use the data. Access to the system enables a user to extract data from any country in the database. To apply for access go
here.
What are the restrictions on use?
[top]
Our agreements with the various countries and the Church of Latter-Day Saints (LDS) require that NAPP data be used only for scholarly and educational purposes. Commercial use of the data is prohibited. To gain access, applicants are required to agree to a number of conditions that amount to a legal contract. Chief among them are a prohibition against redistributing the data, citing NAPP data appropriately for publications and research reports, and not using the data for genealogical purposes.
Can the NAPP data be used for genealogical purposes?
[top]
No. NAPP data cannot be used for genealogical purposes.
The original transcription of the United States 1880, British 1881, and Canadian 1881 censuses was undertaken by the Church of Jesus Christ of Latter Day Saints (LDS) who provided the data for social scientific research with the stipulation that it not be available for genealogical research. Similarly the Swedish censuses were transcribed by the National Archives of Sweden, who charge for access to the data for genealogical purposes. We screen all applications for use to ensure that genealogists who wish to use the data are directed to a more appropriate site.
The LDS provide a free searchable interface to the data at
Family Search. Please use this website for searching these three censuses. The
Norwegian Historical Data Center maintains a searchable interface to the Norwegian censuses in
Norwegian and
English. A searchable interface to the Swedish censuses is
available.
Our continuing provision of the data for social science research relies on these provisions against genealogical use of the data being upheld. Misusing the data by violating any of the conditions detailed above constitutes a violation of the user agreement and may lead to professional censure, loss of employment, or civil prosecution under relevant national and international laws, and to sanctions against your institution, at the discretion of the University of Minnesota and the other institutions collaborating on the North Atlantic Population Project.
Further resources for genealogical researchFor further resources on genealogical research, please see the National Archives of the following participating countries:
CanadaGreat BritainIcelandNorwaySweden and
United StatesOther limited genealogical information can be found at
Ancestry,
Rootsweb,
1901 Census of England and Wales, and
Index to the 1901 Census of Canada. Please note that links to these sites do not imply an endorsement of their contents by the North Atlantic Population Project.
Basic concepts
What are microdata?
[top]
Census microdata are composed of individual records containing information collected on persons and households. The unit of observation is the individual. The responses of each person to the different census questions are recorded in separate variables.
Microdata stand in contrast to more familiar "summary" or "aggregate" data. Aggregate data are compiled statistics, such as a table of marital status by sex for some locality. There are no such tabular or summary statistics in the NAPP data.
The NAPP data is microdata, which means that it provides information about individual persons and households. This makes it possible for researchers to create tabulations tailored to their particular questions. Since NAPP includes nearly all the detail originally recorded by the census enumerations, users can construct a great variety of tabulations interrelating any desired set of variables. The flexibility offered by microdata is particularly important for historical research because the aggregate tabulations produced by the national statistical agencies are often not comparable across time, and, until recently, the subject coverage of census publications was limited.
What is the subject content of NAPP data?
[top]
The data series includes information on a broad range of population characteristics, including fertility, nuptiality, life-course transitions, immigration, internal migration, labor-force participation, occupational structure, education, ethnicity, and household composition.
Users should note that not all the subjects enumerated in the censuses were transcribed for the machine readable datasets available through NAPP. The British, Canadian, and United States' data was transcribed by the Church of Jesus Christ of Latter-day Saints, and they omitted some variables. For example, in the United States information on school attendance, literacy, months unemployed, and sickness was omitted. Similar information on school attendance and "infirmities" was omitted in the transcription of the 1881 Canadian census.
How is occupation/industry harmonized across various censuses?
[top]
Occupation and industry are among the most important variables for analyses of comparative social and economic structure because these censuses provide few alternative indicators of socioeconomic status or labor-force participation. The national statistical agencies employed quite different classification systems. With access to the individual-level occupational responses for all people in the censuses, we have coded occupations into a common classification system based on the
HISCO classification scheme. In Great Britain, Norway, and the United States, we also have coded occupations into domestic coding schemes.
What are "integrated variables"?
[top]
Integration -- or "harmonization" -- is the process of making data from different censuses and countries comparable. For example, most censuses ask about marital status; however, they differ both in their classification schemes (one census might recognize only a general category of "married," while another might distinguish between civil and religious marriages) and in the numeric codes assigned to each category ("divorced" might be coded as a "4" in one census and as a "2" in another). To create an integrated variable for marital status we recode the marital status variable from each census into a unified coding scheme that we design. Most of this work is carried out using correspondence tables.
Because some censuses provide more detail than others, a coding scheme that reduced variables down to the lowest common denominator across all samples would inevitably lose important information. As a result, many NAPP integrated variables use composite coding schemes. The first one or two digits of the code provides information available across all samples; the next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. All meaningful detail in the original enumerations is therefore available to researchers if they need it, but they can confine their attention to the less-detailed digits if they wish.
Another component of integration is the variable documentation. The documentation aims to highlight important comparability issues that are not self-evident from the coding structure for the variable. A general comparability discussion emphasizes issues for international comparisons, and country-specific discussions note comparability concerns when making intra-national comparisons over time. NAPP staff must exercise their judgment in composing this documentation -- there is no formula for it. But users need not depend totally on us: the variable documentation provides links to both English-language and original-language census questionnaires and instructions. This material is readily available on every variable description page through the link to "enumeration text."
What are "unharmonized variables"?
[top]
All variables in NAPP are processed to varying degrees. They are documented in English and associated with the relevant sections of the original census instructions. The data are analyzed and often recoded for technical and other considerations. But, not all NAPP variables are
"integrated" for international and inter-temporal comparability.
The regular NAPP variables -- the ones on the main variable availability screen -- are integrated: the same codes and labels apply across all the samples that contain the variable. Unharmonized variables, in contrast, are unique to each sample. They generally correspond to the variables in the original datasets submitted to the NAPP project by the various countries. The unharmonized variable codes and labels are not consistent across samples, but the variables have been processed to make them more regularized. Stray values are recoded; all data are converted to numeric values; data universes are empirically determined; unknown and NIU categories are coded consistently. In addition, each unharmonized variable is assigned a unique name in the NAPP database, and the value labels and other variable documentation are written in English.
Many unharmonized variables serve as inputs for the integrated NAPP variables. For example, underlying the integrated variable for marital status, MARST, are numerous unharmonized variables, typically one per sample -- CA81A428 for Canada1881, GB81B409 for Scotland 1881, and so forth. Each integrated NAPP variable description has a link to the unharmonized variables that served as its input. The unharmonized variables are also accessible in a comprehensive list using the button near the top of the
variables page.
The unharmonized variables can be included in data extracts. Thus researchers can get both the integrated and unharmonized forms of specific variables (for example, the internationally comparable employment status variable, OCC, and the employment status variable specific to 1900 Norway, NO00A434). Perhaps more importantly, the unharmonized variables give researchers access to data that NAPP has not been able to incorporate in an internationally comparable manner.
What are "pointer variables"?
[top]
The NAPP family interrelationship "pointer" variables indicate the location within the household of every person's mother, father, and spouse. Nearly all samples indicate the relationship of each person to the head of household, but it is much harder to relate individuals to persons other than the head (for example, grandchildren to children, sons-in-laws to daughters, or unrelated persons to each other). We have developed a complex core algorithm to make such connections, and we customize it as needed to account for peculiarities of specific samples. The pointer variables are called MOMLOC, POPLOC, and SPLOC in the NAPP system. The variables MOMRULE, POPRULE, and SPRULE indicate the conditions under which a specific link was made. The parental pointer variables identify social parents, not strictly biological ones.
The pointer variables make it possible to construct individual-level variables representing the characteristics of co-resident persons, such as occupation of spouse, age of mother, or educational attainment of father. The data extraction system can perform this step for you. The "Variable options" step includes a feature to "Attach characteristics" of other persons in the household based on the pointer variables. These attached characteristics appear as new variables in your data extract. For maximum flexibility, you can also do this matching yourself. You need to include the serial and person ID variables (SERIAL and PERNUM) in your extract, as well as the pointer variables themselves, to perform the necessary data manipulations.
Does NAPP data have "weights"?
[top]
Most of the NAPP data are complete-count data. The use of weights is not required for generating accurate population statistics. However, the 1875 census of Norway and 1871 and 1901 censuses of Canada are samples and have person weights (PERWT) assigned to them. For all other complete count data, PERWT is set at 1. For users who are merging NAPP data with IPUMS data should keep in mind that PERWT is referred to as
WTPER in IPUMS International.
What does "universe" mean in the variable descriptions?
[top]
The universe is the population at risk of having a response for the variable in question. In most cases, these are the households or persons to whom the census question was asked as reflected on the census questionnaire. For example, children are not usually asked employment questions, and men and children are not asked fertility questions. In some instances the universe suggested by the census questionnaire is not accurate. However, because of post-enumeration data processing, NAPP empirically verifies universes to obtain the most accurate statement possible of the universe. In some cases, there is no independent information in a sample to verify a universe.
Cases that are outside of the universe for a variable are labeled "NIU (not in universe)" on the codes page. Differences in a variable's universe across samples are a common data comparability issue.
The universes will not always be free of apparently erroneous cases. Some persons or households that should not have answered the question did, and some that should have answered may be included in the "NIU" category. But until we perform comprehensive data editing and allocation, we do not know whether the variable in question is in error, or whether the variables that define the universe (for example, age or employment status) are incorrect.
Additionally, users should note the following definitions of the universe for each particular country:
Canada: Individual-level data on the Indian population in the territories is not available. The published census volumes provide aggregate information on these populations who were not individually enumerated. Thus, the published population totals for Canada in 1881 exceed the number of individuals in the 1881 Canadian census data file.
Great Britain: The census includes the population of England, Wales and Scotland; including "Islands in the British Seas" (viz. the Channel Islands and the Isle of Man). Ireland is not included.
Iceland: When released, the census will include the population of the main island of Iceland, and off-shore islands.
Norway: The census includes the complete population of Norway, including indigenous minorities in the three northern provinces of Troms, Finnmark and Nordland.
United States: All states, territories that eventually became states, and the District of Columbia, are included. However, the census enumeration excluded "Indians not taxed," and thus does not contain Indian Territory, contiguous with the present-day state of Oklahoma. Alaska and Hawaii were also excluded from the enumeration.
Getting data
How do I obtain data?
[top]
All NAPP data are delivered through our data extraction system. Instructions for the data extraction system are available
here. Users select the variables and samples they are interested in, and the system creates a custom-made extract containing only this information. The system will pool data from multiple samples into a single data file; in fact, it was primarily designed for this purpose.
Data are generated on our server. The system sends out an email message to the user when the extract is completed. The user must download the extract and analyze it on their local machine. The extract system is accessed
here. If you are not yet registered, you can log in as "guest" and examine the interface.
What format are the data in?
[top]
NAPP produces fixed-column ASCII data. Data are entirely numeric. By default, the extraction system rectangularizes the data: that is, it puts household information on the person records and does not retain the households as separate records. No information is lost, and this is the format preferred by most researchers; however, it can be overridden in the extract system to yield hierarchical data.
In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and STATA are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.
A codebook file is also created with each extract. It records the characteristics of your extract and should be downloaded for record-keeping.
All data files are created in gzip compressed format. You must uncompress the file to analyze it. Most data compression utilities will handle the files.
What is the best way to use the extract system?
[top]
The data extraction system is a flexible tool. There is no need to download variables or samples you don't expect to use for your current analysis. The system records every extract you make. You can reload and modify an old extract, dropping or adding variables or samples. Go to the "Download or Revise Extracts" page and click on the "Revise" link.
Since most of the NAPP samples are complete count data, if you choose lots of variables in a large sample or several samples at the same time, you can make extremely large extracts that will be cumbersome to analyze. The extract system is designed to minimize this problem. Near the end of the extract process is a "Sample size" screen that predicts the size of your data extract and allows you to draw smaller subsamples. This
subsampling feature is described in more detail elsewhere. The system will inform you if your extract violates the maximum size allowable.
If you have multiple samples in your extract, you should include the country of residence (CNTRY) and census year (YEAR). If you want to do analyses across persons within households, you should also include the household index number (SERIAL).
SERIAL and person index within household (PERNUM) are pre-selected in all extracts, but you can choose to unselect them.
How long does a data extract take?
[top]
The time needed to make an extract differs depending on the number and size of samples requested, whether case selection is performed, and the load on our server. Extracts can take from a few minutes to a couple of hours or more. The system sends an email when the extract is completed, so there is no need to stay active on the NAPP site while the extract is being made.
What if the samples are too big for me to handle?
[top]
It is possible to make samples that are extremely large. Extracts over ten gigabytes will not be allowed by our server. If you have a legitimate reason to make larger extracts, email us to request a higher threshold.
Roughly speaking, the number of records times the number of columns of data requested yields the file size of your extract in bytes. The number or records in each sample is given on the
sample quick reference page. Among the current samples, the 1880 U.S. and 1881 England and Wales datasets are particularly large.
Near the end of the data extract process, the "Sample size" screen predicts the size of your data file. If it is over 1 gigabyte, it is highlighted to draw your attention. If it is too large for your purposes, use the "Sample size" screen to draw smaller subsets of some or all of the samples in your extract. Entering numbers in any of the cells in the right half of the screen will tell the system to systematically extract the corresponding number of households from the selected sample(s). The households will be drawn evenly from across the entire country, and the syntax file created by the extract system will contain code to correctly adjust the sampling weights.
If you do not wish to use the "Sample size" feature, you can remove variables or samples from your extract to reduce its size.
How does "sample selection" work on the NAPP web site?
[top]
When a user first enters the variable documentation system, all samples are selected by default. Every variable in the system will display on all relevant screens.
Users can filter the information displayed by selecting only the samples of interest to them. Only the variables available in one of the selected samples will appear in the variable lists. The integrated variable descriptions and codes pages will also be filtered to display only the text and columns corresponding to the selected samples. Sample selections can be altered at any time in your session. Selections do not persist beyond the current session.
When a user enters the extract system after selecting samples while browsing the variables, the extract system will pre-select those samples.
What does "include in extract" mean?
[top]
While browsing variables in the documentation system, it is possible to earmark them to include in a data extract. Checkboxes and buttons labeled "Include in extract" are available in different contexts for this purpose. Any variables you identify in this way will be pre-selected for you when you enter the data extract system. You need not search to find those variables again during the extract process. Your selections only last for the current web session. You do not need to make variable selections outside of the extract system -- it is a convenience.
Why can't I open the data file?
[top]
There are two likely explanations:
1) The data produced by the extract system are gzipped (the file has a .gz extension). You must use a data compression utility to uncompress the file before you can analyze it.
2) You cannot open the data file directly with a statistical package. The file is a simple ASCII file, not a system file in the format of any statistical package. The extract system does, however, generate a syntax (set-up) file to read the ASCII file into your statistical package. You must download the syntax file along with the data file from our server, open the syntax file with your statistical package, and edit the path in the syntax file to point to the location of the data on your local computer. Now you are ready to read in the data.
Is there a preferred statistical package for using the NAPP data?
[top]
NAPP supports SPSS, SAS and STATA. The system does not make data files in those formats, but does generate syntax files with which to read in the ASCII data.
Can I get the original data?
[top]
In accordance with our agreements with the host countries and the Church of Latter-Day Saints (LDS), NAPP does not distribute the original samples provided by our international partners. We do provide access to unharmonized variables, but even these have undergone processing. We clean up stray codes, translate the labels into English, document the variable, and sometimes perform additional programming. We take care not to lose meaningful information in these transformations, so researchers retain access to the full power of the original information in the input variables.
How is a record uniquely identified?
[top]
Four variables constitute a unique identifier for each record in the NAPP: CNTRY, YEAR, SERIAL, and PERNUM (country of residence, census year, household index number, and person index within household). SERIAL is a unique household identifier.
Using NAPP data
Are there tricky aspects of NAPP data to be particularly aware of?
[top]
It is important to examine the documentation for the variables you are using. The codes and labels for variable categories do not tell the whole story. In other words, the syntax labels are not enough. Read the variable comparability discussions for the samples you are interested in. Important comparability issues should be mentioned there. If a variable is of particular importance in your research (for example, it is your dependent variable), you are also well served to read the enumeration text associated with it. This text is linked directly to the variable, so it is quite easy to call it up.
By default, the extract system rectangularizes the data: it puts the household information on the person records and drops the separate household record. This can distort analyses at the household level. The number of observations will be inflated to the number of person records. You can either select the first person in each household (PERNUM) or select the "hierarchical" box in the extract system to get the proper number of household observations. The rectangularizing feature also drops any vacant households, which are otherwise available in some samples. Despite these complications, the great majority of researchers prefer the rectangularized format, which is why it is the default output of our system.
What are the major limitations of the data?
[top]
The data are composed entirely of individual person and household records from population censuses. There are no macroeconomic, business, or aggregate statistics. We do not deliver the published statistics from the population censuses.
Most of the NAPP samples are complete count census data, resulting in larger extract sizes. If you are interested in obtaining a sample of the complete count data, change the "Sample size" near the end of the data extract process. More information on dealing with large extract size can be found
here.
Can I study multiple countries? One country?
[top]
NAPP is designed to facilitate cross-national and cross-temporal research, but there is no restriction against single-country studies. Data extracts can contain records from every country in the entire data series, or records from only a single country. Researchers interested in studying only the United States, however, would be best served by going to
IPUMS-USA, which is optimized for U.S. research and has greater temporal depth.
There were questions asked in the census that I don't see in NAPP. Where are the data?
[top]
The Church of Latter-Day Saints (LDS), in collaboration with local genealogical societies, laboriously digitized the censuses of Great Britain, Canada and the United States, to provide a resource for genealogical research. Some of the variables have not been digitized by the LDS and are not included in the NAPP datasets. Also, some countries did not supply variables for every census question. The census responses may never have been
processed or digitized. It is also possible we have the data, but have insufficient metadata to make it available at this time. When we add a sample, we harmonize only the core group of variables that we think most researchers would desire. Other variables can be accessed through the
unharmonized variables.
Can I find particular individuals in the NAPP data?
[top]
Yes. Samples will contain names (NAMELAST, NAMEFRST) and addresses (ADDRESS). However, some data are only samples, so there is no guarantee any given individual will be in the dataset.
Please note that NAPP data can be used only for academic research. Use of the data for
genealogy is expressly prohibited in the user license agreement to which all persons must agree.
Using the variables page
Variables page menu
[top]
Use the left side of the menu to browse variables:
Household: household variables by group
Person: person variables by group
A-Z or
Sample: integrated variables by letter/unharmonized variables by sample
Use the buttons on the right side of the menu to:
Switch to . . . Variables: toggle between viewing
integrated and
unharmonized variables
Select Samples: limit the display of variable information to selected samples
Options and Help: alter how the variable list is displayed or get help for this page
Variables page details
[top]
The MenuThe variables page allows you to browse integrated and unharmonized variables while limiting and controlling how the information is displayed.
The left side of the menu is for browsing the variables. When you click on the "Switch ..." button it toggles the left side between integrated and harmonized variables.
When you "Select samples" you limit the variable list to display only variables that are available in at least one of those samples. But the effect of selecting samples extends into all the variable descriptions and codes pages you can access through the variable system. Only information relevant to your selected samples will be displayed in any context while you browse the variables. You can change your sample selections at any point.
Selecting samples is a good practice when exploring NAPP because the amount of information can be unwieldy. On the other hand, sometimes you need to see everything to determine what kind of research is possible using the database.
The final menu button is "Options and Help," which is a drop-down with a number of choices. The first item on the list restores the default viewing options for the variables page. The last item invokes this help text. Each of the remaining items on the list is a toggle that provides an alternative view from the default behavior. Most options are only visible when some variables are being displayed.
Use short country codes / Use long country codesSwitch between the 2-letter country abbreviations and longer abbreviations. The short codes are the default.
View one group / View all groupsSwitch between viewing one variable group at a time and viewing all variable groups on one screen. Unless you have a limited number of samples selected, your browser may be slow to display all groups. The default view is one group at a time.
Show availability detail / Show availability summarySwitch between displaying the full sample-specific availability matrix, and a view that only displays the total number of samples that contain each variable. Both views only display or sum the samples that the user has selected in "Select samples." The default view is the detailed availability information. This option is disabled while viewing unharmonized variables.
View available variables / View all variablesSwitch between a view that only displays variables present in one of your selected samples, and a view that displays every variable, even if they are not available. The default view is to only display available variables.
The Variable ListAs you browse the variables, they are displayed in a list containing a number of columns. The variable name links to the variable description, which includes detailed comparability discussions, universes, and enumeration text. The variable codes -- and their associated labels -- can be accessed directly using the "codes" links. The "type" column indicates if it is a person or household variable. In some contexts, like the alphabetic view, the two types are pooled together.
The area to the right of the "codes" column differs between integrated and unharmonized variables. For integrated variables, the default view displays a column for every sample that the user chose in "Select samples." By default, all samples are selected. The country abbreviation and last two digits of the sample year identify each sample at the top of every column. Hover over the country code with the mouse to see the full country name. If a variable is available in a given sample, an "x" is printed in that column.
The unharmonized variables by definition are only available in a single sample; therefore, they do not require detailed availability information.
Each variable has a box on the far left in the column labeled "Include in Extract." Use these to identify variables you wish to include in a data extract. Any variable so identified will be pre-selected for you when you enter the data extract system. This feature is optional; you will have the opportunity to select variables again inside the extract system. Note that your variable selections only last for the current web session.
Using the data extract system
Extract step 1: login
[top]
You must log in to access the data extract system. You can also change your password or reset it on this screen.
If you are not already registered to use the NAPP, you must
apply for access. Expect two working days for a response. In the meantime, you can log in as a "guest" and explore the functionality of the system, but no data will be produced.
Extract step 2: select samples
[top]
In Step 2 of the extract procedure you define some general characteristics of your desired extract.
Choose the preferred file structure for your extract: rectangular (all household information attached to respective household members) or hierarchical (household record followed by person records). The system defaults to rectangular format, which is the overwhelming choice of researchers.
You can choose to use long or short variable names for the
unharmonized variables -- the variables unique to each dataset that have not been integrated. The short names are always 8 characters in length and are not descriptive. The long names are more interpretable, but they can reach 16 characters in length. The system defaults to the long names.
At the bottom of the page, you select the particular census samples or combination of samples you want to include in your extract. You can choose entire countries at a time using the left-most boxes. Samples that have particularly noteworthy issues with respect to data coverage or other limiting factors have an information icon that displays a short note if you hover over it. Clicking on a sample brings up the short sample description in a separate window.
If you selected samples while browsing the variable documentation, those samples are pre-selected in the extract system.
Extract step 3: select variables -- review
[top]
Step 3 begins and ends with the review screen. When you first encounter the screen, it lists the variables pre-selected by the extract system as well as any variables you selected while browsing the documentation. If you selected variables outside the extract system that are unavailable in the samples you chose for this data extract, those variables will have their checkboxes grayed out. You can return to sample selection and alter your choices now. The grayed-out variables will not persist to subsequent viewings of this page.
When you first reach the review screen, you will typically want to "add more variables", but you can continue to the next step, if you are satisfied with your selections. If you choose to add more variables, you must return again to this screen before proceeding to the next step.
Extract step 3: select variables -- add
[top]
The heart of the extract process is selecting variables. Use the left side of the menu bar to select integrated variables by group or by letter. You can switch to unharmonized variables at any time, viewing them by group and by sample. Additional viewing options are available using the "Options and Help" menu. Check the box to the left of a variable name to include it in your extract. The variable availability grid is on the right side of the integrated variable list. The "x"s indicate an integrated variable is available in a given sample. Hovering over the country abbreviation will reveal its full name.
You can browse the variable codes and descriptions using the appropriate links. The variable documentation will open in a new tab or window.
More detailed information on
using the variable menu is available.
After selecting the variables that you want to include in your extract, click the "Review selected variables" button to return to the review screen. From that screen you can proceed to the next step of the extract process or return to add more variables.
Extract step 4: variable options
[top]
In Step 4 you can specify actions to undertake with the variables included in your extract. These include making new variables using information from different person records ("attach characteristics"), or selecting specific kinds of cases to include in your extract ("select cases"). The variable actions are not applicable to every variable.
You can remove variables from your extract in this step by unchecking its box in the "include in extract" column.
All actions on this screen are optional.
Extract step 4: variable options (select cases)
[top]
The "select cases" feature allows users to limit their dataset to contain only records with specific values for selected variables, such as males only from variable sex (SEX). Multiple variables can be used in combination during case selection. Selections for multiple variables are additive, each being implicitly connected by a logical "AND" for processing purposes.
You can choose to include only persons who contain your selection criteria, or to include everyone living in a household with a person who meets the criteria. This is an important distinction.
Users should be careful with the case selection feature. It is possible to select a specific variable category (i.e., polygamous marriage) that does not exist across all the samples in your extract, thereby inadvertently excluding those samples from your dataset.
Extract step 4: variable options (attach characteristics)
[top]
The data extract system can attach a characteristic of a person's mother, father, or spouse as a new variable on the person's record. It can also attach the characteristics of the household head. For example, using the variable "Occupation," it can make a new variable for "Occupation of mother." All persons in the extract who reside in a household with their mother would receive a value for this new variable. Persons without a mother present in the household would receive a missing value. The extract system automatically generates a unique name for the new variable.
The attached-characteristics feature uses the constructed IPUMS family interrelationship "
pointer variables" that identify co-resident mothers, fathers, and spouses for each person. The pointer variables identify social mothers and fathers, not strictly biological parents.
Characteristics cannot be attached in samples in which persons are not organized into households.
Extract step 5: customize sample sizes
[top]
This step is optional. At the top of the screen the extract system calculates the expected size of your data extract. If this is too large, or if you want to manage the sizes of the individual samples in your extract, click on the link at the bottom of the screen to use the sample size tool. If you proceed to the next step without making alterations, you will receive all cases from each sample.
Note: the predicted extract size does not take account of any case selection you may have implemented in Step 4. If you used case selection, your extract probably will be smaller than the size reported on this screen.
To customize sample sizes, type a number in an empty cell of the sample-size tool. For any sample, you can enter only one number to define the density; the other two cells will be calculated from that number. The entry boxed in the first row, "All samples", will apply the same selection to every sample in your extract. The minimum number of cases for any sample is 10,000 households. If you enter a number larger than the number of cases in a sample, the tool will indicate the maximum number.
At any point you can clear your selections and return to the full sizes for every sample.
The sampling unit for the sample-size tool is the household. The system will draw a systematic sample of every Nth household at the proper density to produce the number of cases you requested. The syntax files created by the extract system contain programming to alter the household and person weights to reflect the new sample densities.
Extract step 6: submit
[top]
Step 6 summarizes the extract you have defined. If it is acceptable, click the button to create your extract. There is a text entry box to write a note to describe the content of your extract for your future reference.
Extracts above a certain size are not allowed by the system. If your extract exceeds the maximum, you will receive a warning to this effect. You must reduce the size of your extract by selecting fewer samples or fewer variables, by performing case selection, or by customizing the size of your samples (Step 5).
A series of buttons near the bottom of the screen allow you to jump back to particular points in the extract process to modify your choices. Use these rather than backing through the system or some of your selections may be lost. When you use these buttons, you will have to proceed forward through all of the steps of the extract process back to the summary screen. But as it loads each screen, the system will remember all of your selections that still apply.
When you submit an extract, there will be a delay ranging from minutes to a couple of hours, depending on the size of the job. You do not need to wait on our site for the job to be completed. Our system will send you an email when your extract is ready.
All data are produced in gzip compressed format. Most compression software has no difficulty with the files. The system produces only ASCII fixed column-format data, but with each extract it generates SAS, SPSS, and STATA command files to read the data into one of those statistical packages. The system also creates a codebook file that describes the content of your extract.
The definitions of every extract will remain on our server indefinitely, but the data files are subject to deletion after three days. However, the screen where you download extracts has a feature that lets you revise old extracts. When you click on "revise," all your selections for that extract will be loaded into the system, after which you can edit or regenerate it. Note, however, that each successive data release can create difficulties for recreating old extracts because variable names or codes might change.