Appendix A: Description of the Survey
A.1 Sample Design
The 2001 National Household Survey on Drug Abuse (NHSDA) sample design was part of a coordinated 5-year sample design that will provide estimates for all 50 States plus the District of Columbia for the years 1999 through 2003. The coordinated design facilitates 50 percent overlap in first-stage units (area segments) between each 2 successive years.
For the 5-year 50-State design, 8 States were designated as large sample States (California, Florida, Illinois, Michigan, New York, Ohio, Pennsylvania, and Texas) with samples large enough to support direct State estimates. Sample sizes in these States ranged from 3,502 to 4,023. For the remaining 42 States and the District of Columbia, smaller, but adequate, samples were selected to support State estimates using small area estimation (SAE) techniques. Sample sizes in these States ranged from 852 to 1,069 in 2001.
States were first stratified into a total of 900 field interviewer (FI) regions (48 regions in each large sample State and 12 regions in each small sample State). These regions were contiguous geographic areas designed to yield the same number of interviews on average. Within FI regions, adjacent Census blocks were combined to form the first-stage sampling units, called area segments. A total of 96 segments per FI region were selected with probability proportional to population size in order to support the 5-year sample and any supplemental studies that the Substance Abuse and Mental Health Services Administration (SAMHSA) may choose to field. Eight sample segments per FI region were fielded during the 2001 survey year.
These sampled segments were allocated equally into four separate samples, one for each 3-month period during the year, so that the survey is essentially continuous in the field. In each of these area segments, a listing of all addresses was made, from which a sample of 203,544 addresses was selected. This sample includes a special supplement added in the New York City area in quarter 4 to provide greater precision for any analyses of the effect of the September 11th events. Of the selected addresses, 171,519 were determined to be eligible sample units. In these sample units (which can be either households or units within group quarters), sample persons were randomly selected using an automated screening procedure programmed in a handheld computer carried by the interviewers. The number of sample units completing the screening was 157,471. Youths (aged 12 to 17 years) and young adults (aged 18 to 25 years) were oversampled at this stage. Because of the large sample size associated with this sample, there was no need to oversample racial/ethnic groups, as was done on NHSDAs prior to 1999. A total of 89,745 persons were selected nationwide. Consistent with previous NHSDAs, the final respondent sample of 68,929 persons was representative of the U.S. general population (since 1991, the civilian, noninstitutionalized population) aged 12 or older. In addition, State samples were representative of their respective State populations. More detailed information on the disposition of the national screening and interview sample can be found in Appendix B. Also, additional tables showing sample sizes and estimated population counts for various demographic and geographic subgroups are presented in Appendix G. Definitions of key terms are provided in Appendix D.
The survey covers residents of households (living in houses/townhouses, apartments, condominiums, etc.), noninstitutional group quarters (e.g., shelters, rooming/boarding houses, college dormitories, migratory workers' camps, halfway houses), and civilians living on military bases. Although the survey covers these types of units (they are given a nonzero probability of selection), sample sizes of most specific groups are too small to provide separate estimates. Persons excluded from the survey include homeless people who do not use shelters, active military personnel, and residents of institutional group quarters, such as correctional facilities, nursing homes, mental institutions, and long-term hospitals.
To evaluate the effectiveness of respondent incentives in improving response rates in the NHSDA, an experiment was conducted during the first two quarters of the 2001 survey. A randomized, split-sample, experimental design was embedded within 251 of the main study FI regions to compare the impact of $20 and $40 incentive treatments with a $0 control group on measures of respondent cooperation, data quality, survey costs, and population substance use estimates. To control for interviewer effects, the same FIs were required to work all of the control and treatment cases in an FI region whenever possible. A total of 9,600 respondents participated in the experiment, including 4,233 who received $0, 2,489 who received $20, and 2,878 who received $40. All 9,600 respondents were included in the computation of 2001 NHSDA estimates. For a discussion of the potential impact of the incentive experiment, see Section C.3 in Appendix C.
A.2 Data Collection Methodology
The data collection method used in the NHSDA involves in-person interviews with sample persons, incorporating procedures that would be likely to increase respondents' cooperation and willingness to report honestly about their illicit drug use behavior. Confidentiality is stressed in all written and oral communications with potential respondents, respondents' names are not collected with the data, and computer-assisted interviewing (CAI) methods, including audio computer-assisted self-interviewing (ACASI), are used to provide a private and confidential setting to complete the interview.
Introductory letters are sent to sampled addresses, followed by an interviewer visit. A 5-minute screening procedure conducted using a handheld computer involves listing all household members along with their basic demographic data. The computer uses the demographic data in a preprogrammed selection algorithm to select 0-2 sample person(s), depending on the composition of the household. This selection process is designed to provide the necessary sample sizes for the specified population age groupings.
Interviewers attempt to immediately conduct the NHSDA interview with each selected person in the household. The interviewer requests the selected respondent to identify a private area in the home away from other household members to conduct the interview. The interview averages about an hour and includes a combination of CAPI (computer-assisted personal interviewing) and ACASI. The interview begins in CAPI mode with the FI reading the questions from the computer screen and entering the respondent's replies into the computer. The interview then transitions to the ACASI mode for the sensitive questions. In this mode, the respondent can read the questions silently on the computer screen and/or listen to the questions read through headphones and enter his or her responses directly into the computer. At the conclusion of the ACASI section, the interview returns to the CAPI mode with the interviewer completing the questionnaire.
No personal identifying information is captured in the CAI record for the respondent. At the end of the day when an interviewer has completed one or more interviews, he or she transmits the data to RTI in Research Triangle Park, North Carolina, via home telephone lines.
A.3 Data Processing
Interviewers initiate nightly data transmissions of interview data and call records on days when they work. Computers at RTI direct the information to a raw data file that consists of one record for each completed interview. Even though much editing and consistency checking is done by the CAI program during the interview, additional more complex edits and consistency checks are completed at RTI. Cases are retained only if respondents provided data on lifetime use of cigarettes and at least nine other substances. An important aspect of subsequent editing routines involves assignment of codes when respondents legitimately skipped out of questions that definitely did not apply to them (e.g., if respondents never used a drug of interest). For key drug use measures, the editing procedures identify inconsistencies between related variables. Inconsistencies in variables pertaining to the most recent period that respondents used a drug are edited by assigning an "indefinite" period of use (e.g., use at some point in the lifetime, which could mean use in the past 30 days or past 12 months). Inconsistencies in other key drug use variables are edited by assigning missing data codes. These inconsistencies are then resolved through statistical imputation procedures, as discussed below.
A.3.1 Statistical Imputation
For some key variables that still have missing or ambiguous values after editing, statistical imputation is used to replace ambiguous or missing data with appropriate response codes. For example, the response is ambiguous if the editing procedures assigned a respondent's most recent use of a drug to "use at some point in the lifetime," with no definite period within the lifetime. In this case, the imputation procedures assigned a definite value for when the respondent last used the drug (e.g., in the past 30 days, more than 30 days ago but within the past 12 months, more than 12 months ago). Similarly, if the response is completely missing, the imputation procedures replaced missing values with nonmissing ones.
Missing or ambiguous values are imputed using a methodology developed specifically for the NHSDA in 1999 and called predictive mean neighborhoods (PMN). PMN is a combination of a model-assisted imputation methodology and a random nearest neighbor hot-deck procedure. Whenever feasible, the imputation of variables using PMN is multivariate, in which imputation is accomplished on several response variables at once. Variables requiring imputation were the core demographic variables, core drug use variables (recency of use, frequency of use, and age at first use), income, health insurance, and a variety of roster-derived variables.
In the modeling stage of PMN, the model chosen depends on the nature of the response variable Y. In the 2001 NHSDA, the models included binomial logistic regression, multinomial logistic regression, Poisson regression, and ordinary linear regression, where the models incorporate the design weights.
In general, hot-deck imputation replaces a missing or ambiguous value taken from a "similar" respondent who has complete data. For random nearest neighbor hot-deck imputation, the missing or ambiguous value is replaced by a responding value from a donor randomly selected from a set of potential donors. Potential donors are those defined to be "close" to the unit with the missing or ambiguous value, according to a predefined function, called a distance metric. In the hot-deck stage of PMN, the set of candidate donors (the "neighborhood") consists of respondents with complete data who have a predicted mean close to that of the item nonrespondent. In particular, the neighborhood consists of either the set of the closest 30 respondents, or the set of respondents with a predicted mean (or means) within 5 percent of the predicted mean(s) of the item nonrespondent, whichever set is smaller. If no respondents are available who have a predicted mean (or means) within 5 percent of the item nonrespondent, the respondent with the predicted mean(s) closest to that of the item nonrespondent is selected as the donor.
In the univariate case, the neighborhood of potential donors is determined by calculating the relative distance between the predicted mean for an item nonrespondent, and the predicted mean for each potential donor, then choosing those means defined by the distance metric. The pool of donors is further restricted to satisfy logical constraints whenever necessary (e.g., age at first crack use must not be younger than age at first cocaine use).
Whenever possible, missing or ambiguous values for more than one response variable are considered at a time. In this (multivariate) case, the distance metric is a Mahalanobis distance rather than a relative Euclidean distance. Whether the imputation is univariate or multivariate, only missing or ambiguous values are replaced, and donors are restricted to be logically consistent with the response variables that are not missing. Furthermore, donors are restricted to satisfy "likeness constraints" whenever possible. That is, donors are required to have the same values for variables highly correlated with the response. If no donors are available that meet these conditions, these likeness constraints can be loosened. For example, donors for the age at first use variable are required to be of the same age as recipients, if at all possible.
Although statistical imputation could not proceed separately within each State due to insufficient pools of donors, information about each respondent's State of residence was incorporated in the modeling and hot-deck steps. For most drugs, respondents were separated into three "State usage" categories as follows: respondents from States with high usage of a given drug were placed in one category, respondents from States with medium usage into another, and the remainder into a third category. This categorical "State rank" variable was used as one set of covariates in the imputation models. In addition, eligible donors for each item nonrespondent were restricted to be of the same State usage category (i.e., the same "State rank") as the nonrespondent.
A.3.2 Development of Analysis Weights
The general approach to developing and calibrating analysis weights involved developing design-based weights, dk, as the inverse of the selection probabilities of the households and persons. Adjustment factors, ak(), were then applied to the design-based weights to adjust for nonresponse, to poststratify to known population control totals, and to control for extreme weights when necessary. In view of the importance of State-level estimates with the new 50-State design, it was necessary to control for a much larger number of known population totals. Several other modifications to the general weight adjustment strategy that had been used in past NHSDAs were also implemented for the first time beginning with the 1999 CAI sample.
Weight adjustments were based on a generalization of Deville and Särndal's (1992) logit model. This generalized exponential model (GEM) (Folsom & Singh, 2000) incorporates unit-specific bounds (lk, uk), ks, for the adjustment factor ak() as follows:
where ck are prespecified centering constants, such that lk < ck < uk and Ak = (uk - lk) / (uk - ck)(ck - lk). The variables lk, ck, and uk are user-specified bounds, and is the column vector of p model parameters corresponding to the p covariates x. The -parameters are estimated by solving
where denotes control totals that could be either nonrandom, as is generally the case with poststratification, or random, as is generally the case for nonresponse adjustment.
The final weights wk = dkak() minimize the distance function (w,d) defined as
This general approach was used at several stages of the weight adjustment process including (1) adjustment of household weights for nonresponse at the screener level, (2) poststratification of household weights to meet population controls for various demographic groups by State, (3) adjustment of household weights for extremes, (4) poststratification of selected person weights, (5) adjustment of person weights for nonresponse at the questionnaire level, (6) poststratification of person weights, and (7) adjustment of person weights for extremes.
Every effort was made to include as many relevant State-specific covariates (typically defined by demographic domains within States) as possible in the multivariate models used to calibrate the weights (nonresponse adjustment and poststratification steps). Because further subdivision of State samples by demographic covariates often produced small cell sample sizes, it was not possible to retain all State-specific covariates (even after meaningful collapsing of covariate categories) and still estimate the necessary model parameters with reasonable precision. Therefore, a hierarchical structure was used in grouping States with covariates defined at the national level, at the Census division level within the Nation, at the State-group within Census division, and, whenever possible, at the State level. In every case, the controls for total population within State and the five age groups within State were maintained. Census control totals by age, race, gender, and Hispanicity were required for the civilian, noninstitutionalized population of each State. Unlike 1999 and 2000 NHSDAs, population estimates for the year 2001 (based on the 1990 Census after taking account of known demographic changes) were not published because of the natural requirement to use 2000 Census data for this purpose. However, due to extensive processing needed for the 2000 Census data, the required controls were not available in time for the 2001 NHSDA data processing. As an alternative, the Population Estimates Branch of the U.S. Bureau of the Census produced, in response to a special request, the necessary population estimates based on the 1990 Census. Use of the 1990 Census-based controls for 2001 population estimates certainly helped maintain comparability with previous years' controls. However, for 2001 the demographic estimation method was used unlike previous years wherein the 1990 census 5 percent public use micro data file (U.S. Bureau of the Census, 1992) was used to get the initial breakdown of the published State-level Census projections of the total residential population (which includes military and institutionalized) for demographic domains into two groups followed by the raking ratio method to meet both the State-level residential population counts as well as the national-level civilian and noncivilian counts for each domain.
Several other enhancements to the weighting procedures were also implemented starting in 1999. The control of extreme weights through winsorization was incorporated into the calibration processes for both nonresponse and poststratification adjustment. Winsorization was used to set bounds for extreme values at prespecified levels, and the GEM model was used to adjust the weights within bounds for both extreme and nonextreme weights such that the desired calibration controls were met. A step was added to poststratify the household-level weights to obtain Census-consistent estimates based on the household rosters from all screened households; these household roster-based estimates then provided the control totals needed to calibrate the respondent pair weights for subsequent planned analyses. Also, the adjusted screened household roster-based estimates provided the control totals for the additional step of poststratifying the selected persons sample. This additional step takes advantage of the inherent two phase nature of the NHSDA design. The final step in poststratification related the respondent person sample to external census data (defined within State whenever possible as discussed above).
This page was last updated on June 16, 2008.