Go to the Table Of Contents

Section B: Substate Region Estimation Methodology

This report includes substate region–level estimates of 23 substance use measures (see Section B.2) using the combined data from the 2004, 2005, and 2006 National Surveys on Drug Use and Health (NSDUHs).

The survey-weighted hierarchical Bayes (SWHB) methodology used in the production of State estimates from the 1999-2006 surveys also was used in the production of the 2004-2006 substate estimates. The SWHB methodology is described by Folsom, Shah, and Vaish (1999). A brief discussion of the precision of the estimates and interpretation of the prediction intervals (PIs) is given in Section B.1. Section B.2 lists the 23 substance use measures for which substate-level small area estimates were produced. The list of predictors used in the 2004-2006 substate-level small area estimation (SAE) modeling is given in Section B.3. The methodology used to select relevant predictors is described in Section B.4. Procedures used to implement the adjustment of NSDUH weights for the purpose of obtaining substate small area estimates is described briefly in Section B.5. The goals of the SAE modeling, the general model description, and the implementation of SAE modeling remain the same and are described in Appendix E of the 2001 State report (Wright, 2003). A general model description is given in Section B.6. A short description of the calculation of the rate of first use of marijuana, serious psychological distress (SPD), major depressive episode (MDE), and underage drinking is included in Section B.7.

Small area estimates obtained using the SWHB methodology are design consistent (i.e., for States or substates with large sample sizes, the small area estimates are close to the robust design-based estimates). The substate small area estimates when aggregated by using the appropriate population totals result in national small area estimates that are very close to the national design-based estimates. However, for many reasons, including internal consistency, it is desirable to have national small area estimates exactly match the national design-based estimates. Beginning in 2002, exact benchmarking was introduced (see Appendix A, Section A.4, in Wright & Sathe, 2005). The small area estimates presented here have been benchmarked to the national design-based estimates.

B.1. Precision and Validation of the Estimates

The primary purpose of this report is to give policy officials a better perspective on the range of prevalence estimates within and across States. Because the data were collected in a consistent manner by field interviewers who adhered to the same procedures and administered the same questions across all States and substate regions, the results are comparable across the 50 States and the District of Columbia.

The 95 percent PI associated with each estimate provides a measure of the accuracy of the estimate. It defines the range within which the true value can be expected to fall 95 percent of the time. For example, the prevalence of past month use of marijuana in Region 1 in Alabama is approximately 4.0 percent, and the 95 percent PI ranges from 3.0 to 5.3 percent. Therefore, the probability is 0.95 that the true value is within that range. The PI indicates the uncertainty due to both sampling variability and model bias. The key assumption underlying the validity of the PIs is that the State- and substate-level error (or bias) terms in the models behave like random effects with zero means and common variance components.

A comparison of the standard errors (SEs) among substate regions with small (n ≤ 500), medium (500 < n ≤ 1,000), and large (n > 1,000) sample sizes for the 23 measures in this report shows that the small area estimates behave in predictable ways. Regardless of whether or not the substate region is from one of the eight States with a large annual sample size (3,000 to 4,000) or one of the other States (n = 900 annually), the sizes of the PIs are very similar and are primarily a function of the sample size of the substate region and the prevalence estimate of the measure. Substate regions with large sample sizes had the smallest SEs.

For past month use of alcohol, where the national prevalence for all persons aged 12 or older was 51.0 percent (for 2004-2006), the average relative standard error (RSE)3 was about 5.5 percent, and the RSE for substate regions with a sample size greater than 1,000 was about 3.4 percent. For substate regions with sample sizes between 500 and 1,000, the average RSE was 4.7 percent; for sample sizes smaller than 500, the RSE average was 6.3 percent.

For past month use of marijuana (with a national prevalence of 6.1 percent), the average RSE was 10.1 percent for substate regions with large samples. For medium sample sizes, the average RSE was 13.0 percent, and for samples smaller than 500, the RSE was 15.6 percent. Substance use measures with lower prevalences, such as past year use of cocaine (2.4 percent nationally), displayed larger average RSEs. For sample sizes greater than 1,000, the average RSE was 14.8 percent. For substate regions of medium sample sizes, the average RSE was 17.7 percent, and for samples smaller than 500, the average RSE was 19.9 percent.

The SAE methods used for substate regions in this report were previously validated for the NSDUH State-by-age group small area estimates (Wright, 2002). This validation exercise used direct estimates from pairs of large sample States (n = 7,200) as internal benchmarks. These internal benchmarks were compared with small area estimates based on random subsamples (n = 900) that mimicked a single year small State sample. The associated age group–specific small area estimates were based on sample sizes targeted at n = 300. Therefore, validation of the State-by-age group small area estimates should lend some validity to the small sample size substate small area estimates reported here.

B.2. Variables Modeled

Substate-level small area estimates were produced for the following set of 23 binary (0, 1) substance use and mental health measures, using combined data from the 2004-2006 NSDUHs:

  1. past month use of illicit drugs,
  2. past month use of illicit drugs other than marijuana,
  3. past month use of marijuana,
  4. average annual rate of first use of marijuana,
  5. perceptions of great risk of smoking marijuana once a month,
  6. past year use of marijuana,
  7. past year use of cocaine,
  8. past year nonmedical use of pain relievers,
  9. past month use of alcohol,
  10. past month binge alcohol use,
  11. perceptions of great risk of having five or more drinks of an alcoholic beverage once or twice a week,
  12. past month use of cigarettes,
  13. past month use of tobacco products,
  14. perceptions of great risk of smoking one or more packs of cigarettes per day,
  15. past year alcohol dependence,
  16. past year illicit drug dependence,
  17. past year alcohol dependence or abuse,
  18. past year illicit drug dependence or abuse,
  19. past year dependence on or abuse of illicit drugs or alcohol,
  20. needing but not receiving treatment for alcohol use in the past year,
  21. needing but not receiving treatment for illicit drug use in the past year,
  22. past year serious psychological distress (SPD), and
  23. past year major depressive episode (MDE).

In addition to the 23 measures listed above, estimates also have been produced for the underage (aged 12 to 20) use of alcohol and underage binge alcohol use.

B.3. Predictors Used in Logistic Regression Models

Local area data used as potential predictor variables in the mixed logistic regression models were obtained from several sources, including Claritas, the U.S. Census Bureau, the Federal Bureau of Investigation (Uniform Crime Reports), Health Resources and Services Administration (Area Resource File), the Bureau of Labor Statistics, the Bureau of Economic Analysis, the Substance Abuse and Mental Health Services Administration (SAMHSA) (National Survey of Substance Abuse Treatment Services [N-SSATS]), and the National Center for Health Statistics (mortality data). The list of sources of data used in the modeling is provided below.

To obtain a detailed list of predictors, please see Appendix A, Section A.2, of the 2005-2006 State estimates report (Hughes et al., 2008).

B.4. Selection of Independent Variables for the Models

No new variable selection was done. The same fixed-effect predictors that were used in modeling the 2004-2005 and 2005-2006 State estimates and the 2002-2004 substate estimates were used to model the 2004-2006 substate estimates.

B.5. Adjustment of Weights

The person-level NSDUH weights are poststratified (adjusted for) to match census population counts at the State level. Because the objective here was to produce small area estimates for substate regions, it was decided to ratio adjust the person-level sampling weights to population projections (available from Claritas as shown in the Section E table) at the substate by age group by gender level. The advantage to doing this ratio adjustment is to ensure that the adjusted sampling weights better reflect the demography of the substate regions. The downside to this adjustment is that the design-based estimates based on the unadjusted sampling weights may be slightly different (at the national level) from the design-based estimates obtained from the adjusted weights. However, because the aim was to be able to produce reliable substate region-level small area estimates, this ratio adjustment to the weights seemed more appropriate. Note that this ratio adjustment was done at the substate region (363 regions) by age group (12 to 17, 18 to 25, 26 to 34, and 35 or older) by gender (male and female) level collectively over the 3 years (2004, 2005, and 2006) of data.

B.6. General Model Description

The model described here is similar to the logistic mixed hierarchical Bayes (HB) model that was used to produce the 1999-2001 and the 2002-2004 substate small area estimates (Office of Applied Studies [OAS], 2005a, 2006). The following model was used:

log[πaijk / (1 – πaijk)] = xaijk βa + ηai + vaij ,

where πaijk is the probability of engaging in the behavior of interest (e.g., using marijuana in the past month) for person-k belonging to age group-a in substate region-j of State-i. Let xaijk denote a pa × 1 vector of auxiliary variables associated with age group-a (12 to 17, 18 to 25, 26 to 34, and 35 or older) and βa denote the associated vector of regression parameters. The age group-specific vectors of auxiliary variables are defined for every block group in the Nation and also include person-level demographic variables, such as race/ethnicity and gender. The vectors of random effects ηi = (η1i, …, ηAi)′ and vij = (v1ij, …, vAij)′ are assumed to be mutually independent with ηi ~ NA (0, Dη) and vij ~ NA (0, Dv), where A is the total number of individual age groups modeled (generally A = 4). For HB estimation purposes, an improper uniform prior distribution is assumed for βa, and proper Wishart prior distributions are assumed for Dη–1 and Dv–1. The HB solution for πaijk involves a series of complex Markov Chain Monte Carlo (MCMC) steps to generate values of the desired fixed and random effects from the underlying joint distribution. The basic process is described in Folsom et al. (1999), Shah, Barnwell, Folsom, and Vaish (2000), and Wright (2003).

Once the required number of MCMC samples for the parameters of interest are generated and tested for convergence properties (see Raftery & Lewis, 1992), the small area estimates for each age group by race/ethnicity by gender cell within a block group can be obtained. These block group–level small area estimates then can be aggregated using the appropriate population count projections to form substate- and State-level small area estimates for the desired age group(s). These small area estimates then are benchmarked to the national design-based estimates (see Appendix A, Section A.4, in Hughes et al., 2008).

B.7. Calculation of Average Annual Rate (Incidence) of First Use of Marijuana, Serious Psychological Distress, Major Depressive Episode, and Underage Drinking

Incidence rates typically are calculated as the number of new initiates of a substance during a period of time (such as in the past year) divided by an estimate of the number of person years of exposure (in thousands). The incidence definition used in this report employs a simpler form of the at-risk population based on the model-based methodology. This model-based average annual incidence rate for first use of marijuana is defined as follows:

Average annual rate = 100*{[X1 ÷ (0.5 * X1 + X2)] ÷ 2},

where X1 is the number of marijuana initiates in the past 24 months and X2 is the number of persons who never used marijuana. For details on calculating the average annual rate of first use of marijuana from the NSDUH data, see Appendix A, Section A.5, of the 2005-2006 State estimates report (Hughes et al., 2008).

Serious psychological distress (SPD) was measured using the K6 screening instrument for nonspecific psychological distress (Kessler et al., 2003). Responses to the six questions in the scale are combined to generate a score ranging from 0 to 24, with SPD defined as a score of 13 or greater. In the 2004 NSDUH, a random half of the sample of respondents aged 18 or older was administered a "long-form" module, which included additional mental health items preceding the K6 items (sample A), while the other half of the sample was administered a "short-form" module consisting only of the K6 items (sample B). The "short-form" module was continued in the full adult samples in 2005 and 2006. Because of the differential reporting on the K6 items in the context of the "long-form," an adjustment to the K6 data from that half-sample in 2004 was necessary in order to produce the pooled 2004-2006 SPD estimates. The 2004 sample A "long-form" scores were transformed to match the distributional characteristics of the 2004 sample B "short-form" scores using the cumulative distribution function (CDF) adjustment method described in Section A.6, Appendix A, of Wright and Sathe (2006). These adjusted 2004 sample A scores were used in conjunction with the 2004 sample B "short-form" scores and the 2005 and 2006 "short-form" SPD scores to produce the 2004-2006 pooled SPD estimates.

The 2002-2004 substate SPD estimates (which were based on the "long form scores") are, therefore, not comparable with the 2004-2006 estimates in this report.

Beginning in 2004, a module was included in the NSDUH questionnaire that obtained data related to having a major depressive episode (MDE); the module was based on the criteria specified for major depression in the Diagnostic and Statistical Manual of Mental Disorders, 4th edition (DSM-IV) (American Psychiatric Association [APA], 1994). These questions permit estimates to be calculated for lifetime and past year prevalence of MDE, treatment for MDE, and role impairment resulting from MDE. For this report, estimates were produced only for having MDE in the past year. Due to minor wording differences in the questions in the adult and adolescent MDE modules, data from youths aged 12 to 17 were not combined with data from persons aged 18 or older to get an overall estimate for those aged 12 or older. Instead, an estimate for those aged 18 or older was produced. For details on how MDE is defined, see Section A.9 in Appendix A of the 2005-2006 State estimates report (Hughes et al., 2008).

To obtain small area estimates for persons aged 12 to 20 for past month alcohol use and binge alcohol use, a separate set of models was fit for these two outcomes for the 12 to 17 age group and the 18 to 20 age group (similar to what was done for producing substate estimates using the 2002-2004 NSDUH data). For details on underage drinking, see Section A.6, Appendix A, of the 2005-2006 State estimates report (Hughes et al., 2008).


End Notes

3 The RSE of an estimate is the posterior SE divided by the estimate itself. Note that the RSEs have been calculated based on the unbenchmarked small area estimates.

Go to Top of PageGo to the Table of Contents

This page was last updated on June 19, 2008.