Survey Sampling and Weighting




A sample survey is a method for collecting data from or about the members of a population so that inferences about the entire population can be obtained from a subset, or sample, of the population members. As an example, it may be desired to know the average length of stay in a hospital for surgical versus nonsurgical stays in the US and its territories for the 2012 calendar year. In this situation, a sample of hospital discharges would be obtained along with the duration of stay for each discharge. Then estimates of the average length of stays for surgical and nonsurgical discharges would be calculated and compared. A properly conducted sample survey will support inference from the sample that is scientifically valid about the population. This article focuses on probability sampling and weighting to support such inference.

The discussion is organized around four major steps: (1) survey requirements, (2) sampling design, (3) weighting, and (4) design effect. Although these steps are presented as a linear process progressing in order, in practice much iteration between the steps will occur while planning a sample survey. For example, the initial requirements may prove to be financially infeasible when determining the sample design and compromises will need to be made in the requirements.




Survey Requirements

An important first step is to establish the objectives of the survey that will drive the design. Major areas to consider include:

Target Population

The target population is the finite set of all elements or units about which inferences or conclusions are to be drawn. The definition of the target population should be exact in terms of element, place, and time. For the hospital length of stay example, the target population is all hospital discharges (element) in the US and its territories (place) during calendar year 2012 (time). Various subpopulations are also identified which will be important in the subsequent analysis. For example, it may be important to differentiate between urban versus rural hospitals or public versus private hospitals.

Survey Variables

Associated with each element of the target population are the survey variables to be measured. A survey is conducted to gain information about one or more population characteristics or parameters which are defined in terms of the survey variables. For example, the survey variables might be the length of stay for a hospital visit and if it was surgically related or not. The population parameters of interest might be the average length of stay, the median length of stay, or the total number of hospital inpatient days all by type of discharge (surgical vs. nonsurgical).

Objectives

The objectives of the survey can be either descriptive, analytic, or both. The objectives are stated in terms of the population parameters derived from the survey variables. For the hospital stay example, descriptive objects would include estimating the average length of stay for surgical and nonsurgical stays. Such estimates would be important to planners in determining the number of hospital beds needed in a new hospital or in a service region. The estimates might also be used to determine the anticipated total amount of reimbursement a payer might incur for hospital stays. Alternatively, analytic goals might be to determine factors related to length of hospital stay so that best practices can be established to reduce the average length of stay. For example, average length of stays might be compared between surgical modalities or condition treatment plans.

Precision Requirements

The degree of precision required for the survey objectives are needed to establish the final sample design and the sample sizes. For descriptive objectives, precision is usually stated in terms of the maximum standard error of the estimate or in terms of the maximum length of a confidence interval around an estimate. For example, estimate the average length of stay such that its 95% confidence interval is no longer than plus or minus 0.5 day. Precision for analytic objectives is usually stated in terms of the power, or probability, of rejecting a null hypothesis in favor of an alternative hypothesis for a given value of the population parameters that describe the hypothesis. For example, when testing if average surgical length of stay is the same as the average nonsurgical length of stay, it might be required to have an 80% chance to reject the null hypothesis of equality when the actual, but unknown, population values differed by more than 1 day with a type I error rate of 5%.

Sampling Design

The sampling design consists of the procedures by which elements are selected into the sample from the population. The major attributes of a sampling design are presented next.

Survey Population

The survey population includes any modification to the target population established because of resource limitations or other feasibility factors on the survey. The survey population usually limits the target population in some way so that the survey is more readily conducted. For example, it might be cost prohibitive to conduct the survey of hospital discharges in the US territories and the survey population would limit the scope of the survey to the 50 states and the District of Columbia of the US.

Sampling Frame

A sampling frame is an important tool in the process of selecting a sample. The sampling frame is the materials or methods which identify and provide access to the elements of the target population. The sampling frame also includes any auxiliary information required to select the sample or to analyze the resulting data. A rule must exist that allows enumeration of all of the elements of the target population. The sampling frame can be a simple listing of all of the members of the target population; for example, a list of all of the hospitals in the US and its territories. More commonly, the sampling frame consists of processes and rules that provide access to the target population. For the hospital length of stay example, a complete listing of all hospital discharges does not exist. However, a multistage approach can be used where lists of hospitals can be used to contact selected hospitals each of which can provide access to a listing of its discharges.

Stratified Sampling

Stratified sampling is one of the major design features used in almost all sample surveys. Stratification is the process of dividing the population into mutually exclusive and exhaustive groups and then selecting a separate independent sample from each stratum. When the observations within each stratum are more homogenous than those between the strata, the variance of the resulting estimate will be reduced. However, stratification is more importantly used to assure that an adequate sample size is obtained for analysis from the various sub-populations included in the survey objectives. For example, it is likely that a nonstratified random sample of hospitals will not contain enough hospitals from rural areas for analysis purposes. In this situation, stratifying the sample of hospitals by urban versus rural areas allows an adequate sample size of rural hospitals to be selected to support the survey’s analytic objectives.

Multistage Sampling

Another important and commonly used design feature is multistage sampling. This is a process by which sampling is carried out in two or more stages. At the first, or primary stage, clusters of the sampling elements are formed and a sample of the clusters is selected. At the second stage, a subsample of the elements within each of the selected first-stage clusters is selected. In the hospital length of stay example, the discharges could be clustered by the hospitals where they occurred. A sample of hospitals would then be selected at the first stage followed by a sample of discharges from each of the selected hospitals at the second stage. More than two stages of sampling may be used. For example, a sample of geographic areas, such as US counties, could be selected at the first stage, followed by a second sample of hospitals and then a sample of discharges at the third stage. As noted when discussing sampling frames, multistage sampling is useful when a sampling frame must be constructed in stages. It is also used to control the cost of conducting a survey by concentrating the data collection effort at a limited, and predetermined, number of locations. For example, if a random sample of discharges was selected without clustering, then data collection would occur at a large number of hospitals across the US leading to high data collection costs. However, a two-stage sample of hospitals followed by discharges within selected hospitals would be less expensive as the data collection effort can be concentrated at a smaller number of hospitals.

Probability Sampling

Probability sampling is the mechanism through which inference is extended from the sample to the population. A probability sampling plan associates a nonzero probability of selection with each and every member of the survey population such that the selection probability can be determined for every member of the sample. A random process is used to select the sample so that the desired probabilities of selection are achieved. To demonstrate how probability sampling supports population inference, assume that a probability sample of size n is selected from a survey population of N elements. Then let di be 1 if the i-th element of the survey population is selected into the sample and 0 if it is not selected. The probability that the i-th element is selected into the sample is πi = E(δi), where the expectation is over the random process used to select the sample. Associated with each survey population element is the value of a survey variable Yi, with the observed value for each sample member being yj. The population total is Y+=∑Ni=1Yi with the sample total estimator being y+=∑nj=1yjj. It follows that E(y+)=E[∑nj=1yjj]=E[∑Ni=1δiYii]=∑Ni=1E(δi)Yii=Y+ showing that the sample total estimator is an unbiased estimate of its corresponding population total.

Simple Random Sampling

Simple random sampling is one of the most easily implemented types of probability sampling. The two forms of simple random sampling are with replacement and without replacement. Without replacement sampling assigns the same chance of selection to all (N /n) possible without replacement samples of size n from a survey population of N elements. With replacement sampling assigns the same chance of selection to all Nn possible with replacement samples. In either case, the selection probability for any member of the sample is n/N. Simple random sampling without replacement is more commonly used and is appropriate when each member of the survey population is of equal interest or importance.

Probability Proportional To Size Sampling

Probability proportional to size (PPS) sampling is commonly used when selecting multistage samples. PPS sampling, as its names implies, results in each sample member having a selection probability proportional to a measure of its size. For example, the size of a hospital might be measured by its annual number of discharges or its number of beds. Similarly, the size of a geographic unit might be the number of persons living in the unit. The PPS selection probability for a unit is πi=nSi/S+, where n is the sample size, Si is the size measure for the i-th unit, and S+=∑Ni=1Si is the total of all size measures for units in the survey population. When a very large sampling unit has a size measure such that Si>S+/n, then the unit is called a self-representing unit as its PPS selection probability is greater than one. In this situation, all self-representing units are included in the sample with probability one and the remainder of the sample is selected PPS from the survey population excluding the self-representing units.

Equal Probability Of Selection Method

Equal probability of selection method (EPSEM) is any sampling design that yields equal selection probabilities for the ultimate sampling elements used in the analysis. Having equal selection probabilities for the analysis units is often a desirable property as it usually reduces the variance of the survey estimates. In multistage sampling, this is achieved by combining PPS and simple random sampling at different stages of sampling. For the hospital length of stay example, assume that a two-stage sample of hospitals followed by discharges within the selected hospitals is planned. A common approach would be to select a PPS sample of n hospitals, where the size measure is the number of discharges from the hospital (Si). This would then be followed by a sample random sample of m discharges from each selected hospital. The selection probability for the j-th discharge from the i-th hospital is the product of the hospital’s selection probability and the conditional selection probability of the discharge from its hospital. This is πij= πi ×πj|i = (nSi/S+)×(m/Si)=nm/S+ and is the same for all discharges regardless from which hospital a discharge is selected. In most situations it is not possible to have a size measure from which exactly equal probabilities of selections are achieved. However, using a measure of size that is proportional to the desired measure of size will yield nearly equal selection probabilities. In the example, the number of hospital beds or the number of discharges from a previous year would usually be good measures of size.

Weighting

As was shown above, probability sampling provides a process for drawing valid inferences from a small sample about the population parameters of a large population. This is done by defining a sampling weight for each sample member that is the inverse of its sample selection probability. Symbolically, the sampling weight for the j-th sample member is wjj-1 The sampling weight can be roughly thought of as the number of population members that a sample member represents. The sampling weights are used to expand the sample members up to approximate the population. When all of the selected sample members respond and cooperate with the survey, unbiased estimates of linear population parameters, like population totals, are obtained when the sampling weights are used to expand the sample data. Nonlinear population parameters are consistently estimated through functions of weighted estimates of totals.

To illustrate this process, assume that a probability sample of n hospital discharges, both surgical and nonsurgical stays, from a total population of N hospital discharges has been selected. Two population parameters of interest are the total number of days spent in hospital and average length of stay, both for surgical hospital stays. Let Yi be the length of stay associated with the i-th discharge in the population and let yi be 1 if the discharge is for a surgical stay and 0 otherwise. The population total number of days spent in the hospital for surgical stays is Ys+ = ∑Ni=1yiYi and the population total number of surgical stays is NS+=∑Ni=1yi. Thus, the population average length of surgical stay is AS=YS+/NS+ . The unbiased estimators of YS+ and NS+ are the weighted sample values yS+=∑nj=1wjyjyj and nS+=∑nj=1wjyj, respectively. A consistent estimator of the population average length of surgical stay, AS, is the ratio of the two weighted sample values as= ys+ /ns + . This example demonstrates the general process of using the sampling weights to expand the survey values associated with the sample members to unbiasedly estimate the population totals. The estimate totals are then combined to consistently estimate other population parameters such as means, percentages, and regression coefficients.

In almost all surveys some selected sample member will not respond and their data will be missing. Simply leaving out the missing data from the sample nonrespondents will bias the resulting estimates. To mitigate the effect of the missing data, adjustments to the sampling weights are used to create analysis weights, which compensate for the nonrespondents in the analyses. Weight adjustment methods are beyond the scope of this chapter.

Design Effect

Complex sample surveys rarely result in a set of independent and identically distributed observations because of sample design features such as stratification, multistage sampling, and unequal weighting. Such features affect the variance of survey estimates and specialized software is needed for the analysis that allows the sample design to be used when estimating the variances. For example, survey data analysis software is available in SUDAAN®, SAS®, and Stata®.

To understand the effect of the design features, the concept of a design effect is used. The design effect is the ratio of the variance under the sample design used to collect the data to the variance of a simple random sample selected with replacement of the same sample size. Symbolically, the design effect of the mean is DEFF=Var(y)/(S2/n), where S2 is the population variance of the variable in question, and Var(y) and n are the variance of the estimate and the sample size under the sample design used to collect the data.

The sample design feature that usually most affects the variance is multistage sampling. When clusters of observations are selected together, the variance of an estimate is usually increased because the observations within a cluster are most often positively correlated. In a two-stage sample design, where clusters are sampled first followed by individual observations within each cluster, the amount of increase in the variance of the estimated mean is approximately DEFF=1+(m -1)py where m is the average number of observations selected per cluster from the analysis domain and py is the intracluster correlation between two observations in a cluster. In the hospital length of stay example, the clusters are the hospitals and it would be expected that the length of stays for discharges from the sample hospital are positively correlated. For regression coefficients, the inflation, or possible deflation, in variance is approximately DEFF=1+(m-1)py px, where py and px are the intracluster correlation coefficients for the dependent variable and the independent variable, respectively. For certain designs and regression models it is possible for px to be negative, resulting in a decrease in the variance of the estimated coefficient.

A related concept is the effective sample size which is given by ne = n/DEFF. The effective sample size is the sample size for a simple random sample selected with replacement that yields the same variance of an estimate as that obtained from the sample design used to collect the data. An enlightening example for the mean estimated from a two-stage design illustrates the interpretation of the effective sample size. Consider a two-stage design where 10 ( = m) sampling units are selected from each of the 50 sampled clusters for a total sample size of 500. If py = 1, then DEFF = 10 and ne = 50, the number of cluster are perfectly related and no further information is gained by selecting more than one observation from each cluster. Thus, the effective sample size is the number of clusters. However, if py = 0, then the observations within each cluster are unrelated, and DEFF = 1 and ne = 500. This is the situation of independent observations all of which contribute equal information to the estimate. In most situations, py is between 0 and 1, and the effective sample in this example is between 50 and 500.

The effective sample size can be used to estimate power or precision when planning a survey. The effective sample size can be approximated using the relationships described above using information from previous studies to approximate ne = n/DEFF and then used in a power/precision formula or software package to determine the approximate power or precision.

References:

  1. Kish, L. (1965). Survey sampling. New York: Wiley.
  2. Lavrakas, P. J. (ed.) (2008). Encyclopedia of survey research methods, vol. 2. Los Angles: Sage.
  3. Levy, P. S. and Lemeshow, S. (2008). Sampling of populations: Methods and clusters. This is the situation where the observations within a applications, 4 ed. New York: Wiley.
  4. Sarndal, C. E., Swensson, B. and Wretman, J. (1992). Model assisted survey sampling. New York: Springer-Verlag.
Spatial Econometrics