Go to the Table Of Contents

2000 State Estimates of Substance Use & Mental Health

bulletNational data      bulletState level data       bulletMetropolitan and other subState area data

Appendix D: Description of the Survey

D.1 Sample Design

The 2000 National Household Survey on Drug Abuse (NHSDA) sample design was part of a coordinated 5-year sample design to provide estimates for all 50 States plus the District of Columbia for the years from 1999 through 2003. The coordinated design facilitates 50 percent overlap in first-stage sampling units (called "area segments") between each 2 successive years.

For the 5-year 50-State design, eight States were designated as large sample States (California, Florida, Illinois, Michigan, New York, Ohio, Pennsylvania, and Texas) with samples large enough to support direct State estimates. Sample sizes in these States ranged from 3,478 to 5,022. For the remaining 42 States and the District of Columbia, smaller, but adequate, samples were selected to support State estimates using small area estimation (SAE) techniques. Sample sizes in these States ranged from 828 to 1,200.

States were first stratified into a total of 900 field interviewer (FI) regions (48 regions in each large sample State and 12 regions in each small sample State). These regions were contiguous geographic areas designed to yield the same number of interviews on average. Within FI regions, adjacent Census blocks were combined to form the area segments. A total of 96 area segments per FI region were selected with probability proportional to population size in order to support the 5-year sample and any supplemental studies that the Substance Abuse and Mental Health Services Administration (SAMHSA) may choose to field. Eight sample segments per FI region were fielded during the 2000 survey year.

These sampled segments were allocated equally into four separate samples, one for each 3-month period during the year, so that the survey is essentially continuous in the field. In each of these area segments, a listing of all addresses was made, from which a sample of 215,860 addresses was selected (see Table E.1 in Appendix E). Of these, 182,576 were determined to be eligible sample units. In these sample units (which can be either households or units within group quarters), sample persons were randomly selected using an automated screening procedure programmed in a handheld computer carried by the interviewers. The number of sample units completing the screening was 169,769. Youths aged 12 to 17 years and young adults aged 18 to 25 years were oversampled at this stage. Because of the large sample size associated with this sample, there was no need to oversample racial/ethnic groups, as was done on NHSDAs prior to 1999. A total of 91,961 persons were selected nationwide (see Table E.2 in Appendix E). Consistent with previous NHSDAs, the final respondent sample of 71,764 persons was representative of the U.S. general population (since 1991, the civilian, noninstitutional population) aged 12 or older. In addition, State samples were representative of their respective State populations. More detailed information on the disposition of the national screening and interview sample can be found in Appendix E.

The survey covers residents of the following civilian domiciles, as well as civilians living on military bases: (a) households (houses/townhouses, apartments, condominiums, etc.), and (b) noninstitutional group quarters (shelters, rooming/boarding houses, college dormitories, migratory workers' camps, halfway houses, etc.). Although the survey covers these types of units (they are given a nonzero probability of selection), sample sizes of most specific groups are too small to provide separate estimates. Persons excluded from the survey include homeless people who do not use shelters, active military personnel, and residents of institutional group quarters, such as correctional facilities, nursing homes, mental institutions, and hospitals.

Unlike the 1999 NHSDA, which also included a supplemental sample using the paper-and-pencil interviewing (PAPI) mode for the purposes of measuring trends with estimates comparable with estimates from 1998 and prior years, the 2000 NHSDA was fielded entirely using computer-assisted interviewing (CAI) methods.

D.2 Data Collection Methodology

The data collection method used in the NHSDA involves in-person interviews with sample persons, incorporating procedures that would be likely to increase respondents' cooperation and willingness to report honestly about their illicit drug use behavior. Confidentiality is stressed in all written and oral communications with potential respondents, respondents' names are not collected with the data, and CAI methods, including audio computer-assisted self-interviewing (ACASI), are used to provide a private and confidential setting to complete the interview.

Introductory letters are sent to sampled addresses, followed by an interviewer visit. A 5-minute screening procedure conducted using a handheld computer involves listing all household members along with their basic demographic data. The computer uses the demographic data in a preprogrammed selection algorithm to select 0–2 sample person(s), depending on the composition of the household. This selection process is designed to provide the necessary sample sizes for the specified population age groupings.

Interviewers attempt to immediately conduct the NHSDA interview with each selected person in the household. The interviewer requests the selected respondent to identify a private area in the home away from other household members to conduct the interview. The interview averages about an hour and includes a combination of CAPI (computer-assisted personal interviewing) and ACASI. The interview begins in CAPI mode with the FI reading the questions from the computer screen and entering the respondent's replies into the computer. The interview then transitions to the ACASI mode for the sensitive questions. In this mode, the respondent can read the questions silently on the computer screen and/or listen to the questions read through headphones and enter his or her responses directly into the computer. At the conclusion of the ACASI section, the interview returns to the CAPI mode with the interviewer completing the questionnaire.

No personal identifying information is captured in the CAI record for the respondent. At the end of the day when an interviewer has completed one or more interviews, he or she transmits the data to RTI in Research Triangle Park, North Carolina, via home telephone lines.

D.3 Data Processing

Interviewers initiate nightly data transmissions of interview data and call records on days when they work. Computers at RTI direct the information to a raw data file that consists of one record for each completed interview. Even though much editing and consistency checking is done by the CAI program during the interview, additional more complex edits and consistency checks were completed at RTI. Resolution of most inconsistencies and missing data was done using machine editing routines that were developed specifically for the CAI instrument. Cases were retained only if the respondent provided data on lifetime use of cigarettes and at least nine other substances.

D.3.1 Statistical Imputation

For some key variables that still have missing values after the application of editing, statistical imputation is used to replace missing data with appropriate response codes.

Considerable changes in the imputation procedures used in past NHSDAs were introduced beginning with the 1999 CAI sample. Three types of statistical imputation procedures are used:

Because the primary demographic variables (e.g., age, gender, race/ethnicity, employment, education) are imputed first, few variables are available for model-based imputation. Moreover, most demographic variables have a very low level of missingness. Hence, unweighted sequential hot deck is used to impute missing values for demographic variables. The demographic variables can then be used as covariates in models for drug use measures. These models also include other drug use variables as covariates. For example, the model for cocaine use includes cigarette, alcohol, and marijuana use as covariates. The univariate predictive mean neighborhood method is used as an intermediate imputation procedure for recency of use, 12-month frequency of use, 30-day frequency of use, and 30-day binge drinking frequency for all drugs where these variables occur. The final imputed values for these variables are determined using multivariate predictive mean neighborhoods. The final imputed values for age at first use for all drugs and age at first daily cigarette use are determined using univariate predictive mean neighborhoods.

Hot-deck imputation involves replacing a missing value with a valid code taken from another respondent who is "similar" and has complete data. Responding and nonresponding units are sorted together by a variable or collection of variables closely related to the variable of interest Y. For sequential hot-deck imputation, a missing value of Y is replaced by the nearest responding value preceding it in the sequence. With random nearest neighbor hot-deck imputation, the missing value of Y is replaced by a responding value from a donor randomly selected from a set of potential donors close to the unit with the missing value according to some distance metric. The predictive mean neighborhood imputation involves determining a predicted mean using a model, such as a linear regression or logistic regression, depending on the response variable, where the models incorporate the design weights. In the univariate case, the neighborhood of potential donors is determined by calculating the relative distance between the predicted mean for an item nonrespondent and the predicted mean for each potential donor, then choosing those within a small preset value (this is the "distance metric"). The pool of donors is further restricted to satisfy logical constraints whenever necessary (e.g., age at first crack use must not be younger than age at first cocaine use). Whenever possible, more than one response variable was considered at a time. In that (multivariate) case, the Mahalanobis distance across a vector of several response variables' predicted means is calculated between a given item nonrespondent and each candidate donor. The k smallest Mahalanobis distances, say 30, determine the neighborhood of candidate donors, and the nonrespondent's missing values in this vector are replaced by those of the randomly selected donor. A respondent may only be missing some of the responses within this vector of response variables; in that case, only the missing values were replaced, and donors were restricted to be logically consistent with the response variables that were not missing.

Although statistical imputation could not proceed separately within each State due to insufficient pools of donors, information about the State of residence of each respondent is incorporated in the modeling and hot-deck steps. For most drugs, respondents were separated into three State usage categories for each drug depending on the response variable of interest. Respondents from States with high usage of a given drug were placed in one category, respondents from medium usage States into another, and the remainder into a third category. This categorical "State rank" variable was used as one set of covariates in the imputation models. In addition, eligible donors for each item nonrespondent were restricted to be of the same State usage category (the same "State rank") as the item nonrespondent.

During the processing of the 2000 NHSDA data, an error was detected in the computer programs that assigned imputed values for drug use variables that had missing information in the 1999 NHSDA data file. These variables are used in making estimates of substance use incidence and prevalence. In preparing the Summary of Findings from the 2000 NHSDA (OAS, 2001), the 1999 data were adjusted to correct for the error. For most substance use measures, the impact of the revision is small. Estimates of lifetime use of substances were not affected at all. Estimates of past year and past month use were all revised, but the updated numbers in many cases are nearly identical to the old ones. The effects of the error are noticeable for only four substances (alcohol, marijuana, inhalants, and heroin), in addition to the composite measures "any illicit drug use" and "any illicit drug other than marijuana." For these substances, all of the revised estimates are lower than the previous ones. For inhalants, the revised estimates are considerably lower, especially among youths. See Appendix E for more detailed information.

D.3.2 Development of Analysis Weights

The general approach to developing and calibrating analysis weights involved developing design-based weights, D sub k represents the design based weights for person-k, as the inverse of the selection probabilities of the households and persons. Adjustment factors, The adjustment factor a sub k is a function of lambda, where lambda is the column vector of model parameters., were then applied to the design-based weights to adjust for nonresponse, to control for extreme weights when necessary, and to poststratify to known population control totals. In view of the importance of State-level estimates with the new 50-State design, it was necessary to control for a much larger number of known population totals. Several other modifications to the general weight adjustment strategy that had been used in past NHSDAs were also implemented for the first time beginning with the 1999 CAI sample.

Weight adjustments were based on a generalization of Deville and Särndal's (1992) logit model. This generalized exponential model (GEM) (Folsom & Singh, 2000) incorporates unit-specific bounds Notation depicts unit-specific lower bound l sub k and upper bound u sub k, respectively, Notation indicating that k belongs to sample s for the adjustment factor The adjustment factor a sub k is a function of lambda, where lambda is the column vector of model parameters. as follows:

The adjustment factor a sub k depends on lambda and is defined as the ratio of two quantities. The quantity in the numerator is defined as the sum of two terms. The first term is calculated as the product of l sub k and the difference of u sub k and c sub k. The second term is calculated as the product of u sub k, the difference of l sub k and c sub k, and the value of the exponential function obtained at cap A sub k times the linear combination of covariates corresponding to person k using elements of lambda as the multipliers. The quantity in the denominator is defined as the sum of two terms. The first term is the difference of u sub k and c sub k. The second term is calculated as the product of the difference of l sub k and c sub k, and the value of the exponential function obtained at cap A sub k times the linear combination of covariates corresponding to person k using elements of lambda as the multipliers. ,

where Notation depicts centering constant c sub k are prespecified centering constants, such that Notation depicting that the centering constant c sub k is bounded below by l sub k and bounded above by u sub k and The cap A sub k is defined as the ratio of two quantities. The quantity in the numerator is defined as the difference of u sub k and l sub k. The quantity in denominator is defined as the product of two terms. The first term is the difference of u sub k and c sub k. The second term is the difference of c sub k and l sub k. The variables Notation depicts unit-specific lower bound for the adjustment factor, l sub k, Notation depicts centering constant, c sub k , and Notation depicts unit-specific upper bound for the adjustment factor, u sub k are user-specified bounds, and lambda is the column vector of p model parameters corresponding to the p covariates x. The lambda–parameters are estimated by solving

This is the objective function used to solve for p lambda parameters. The objective function involves p simultaneous equations corresponding to p covariates used in the model. Let tilda cap T sub x denote the control total for covariate x and the adjusted weight for person k is obtained by taking the product of the original design weight and the adjustment factor a sub k of lambda. Then the objective function equation for covariate x is defined as the sample weighted (using adjusted weight) sum of covariate x equals to the control total for covariate x.,

where Tilde cap T sub x denotes the control total for covariate x denotes control totals, which could be either nonrandom as is generally the case with poststratification, or random as is generally the case for nonresponse adjustment.

The final weights The final weight w sub k equals product of the design weight d sub k, and adjustment factor a sub k of lambda. minimize the distance function Notation for delta of w and d. defined as

Equation used to minimize the distance function. Delta of w and d equals summing over all k (persons), the ratio of d sub k and cap A sub k times sum of two quantities. The first quantity is calculated as the product of the difference of a sub k and l sub k, and the natural logarithm of the ratio of the difference of a sub k and l sub k to the difference between c sub k and l sub k. The second quantity is defined as the product of the difference of u sub k and a sub k, and the natural logarithm of the ratio of the difference of u sub k and a sub k to the difference between u sub k and c sub k. .

This general approach was used at several stages of the weight adjustment process, including (1) adjustment of household weights for nonresponse at the screener level, (2) poststratification of household weights to meet population controls for various demographic groups by State, (3) adjustment of household weights for extremes, (4) poststratification of selected person weights, (5) adjustment of person weights for nonresponse at the questionnaire level, (6) poststratification of person weights, and (7) adjustment of person weights for extremes.

Every effort was made to include as many relevant State-specific covariates (typically defined by demographic domains within States) as possible in the multivariate models used to calibrate the weights (nonresponse adjustment and poststratification steps). Because further subdivision of State samples by demographic covariates often produced small cell sample sizes, it was not possible to retain all State-specific covariates and still estimate the necessary model parameters with reasonable precision. Therefore, a hierarchical structure was used in grouping States with covariates defined at the national level, at the Census division level within the Nation, at the State-group level within the Census division, and, whenever possible, at the State level. In every case, the controls for total population within State and the five age groups within State were maintained. Census control totals by age and race were required for the civilian, noninstitutionalized population of each State. Published Census projections (U.S. Bureau of the Census, 2000) reflected the total residential population (which includes military and institutionalized). The 1990 Census 5 percent public use micro data file (U.S. Bureau of the Census, 1992) was used to distribute the State residential population into two groups, then the method of raking-ratio adjustment was used to get the desired domain-level counts such that they respect both the State-level residential population counts as well as the national-level civilian and noncivilian counts for each domain. This was done for the midpoint of each NHSDA data collection period (i.e., quarter) such that counts aggregated over the quarters correspond to the annual counts.

Several other enhancements to the weighting procedures were also implemented starting in 1999. The control of extreme weights through winsorization was incorporated into the calibration processes. Winsorization truncates extreme values at prespecified levels and distributes the trimmed portions of weights to the nontruncated cases; note that this process was carried out using the GEM model discussed above. A step was added to poststratify the household-level weights to obtain Census-consistent estimates based on the household rosters from all screened households; these household roster-based estimates then provided the control totals needed to calibrate the respondent pair weights for subsequent planned analyses. An additional step poststratified the selected person sample to conform with the adjusted roster estimates. The final step in poststratification related the respondent person sample to external Census data (defined within State whenever possible as discussed above).

D.4 References

Deville, J. C., & Särndal, C. E. (1992). Calibration estimating in survey sampling. Journal of the American Statistical Association, 87, 376–382.

Folsom, R. E., & Singh, A. C. (2000, August). The general exponential model for sampling weight calibration for extreme values, nonresponse, and poststratification. Presented at the Joint Statistical Meetings of the American Statistical Association, Indianapolis, IN.

Office of Applied Studies. (2001). Summary of findings from the 2000 National Household Survey on Drug Abuse (DHHS Publication No. SMA 01–3549, NHSDA Series H–13; available at /p0000016.htm#standard). Rockville, MD: Substance Abuse and Mental Health Services Administration.

U.S. Bureau of the Census [producer and distributor]. (1992). Census 1990 Microdata—Census of Population and Housing, 1990: Public use microdata U.S. [machine-readable file]. Washington, DC: The Census Bureau.

U.S. Bureau of the Census. (2000). Census projections: State population projections: 1995 to 2000. Retrieved March 16, 2001, from http://www.census.gov/ and www.census.gov/population/www/projections/st_yr95to00.html

Go to the Table of Contents

This page was last updated on December 30, 2008.