This article examines empirical models in health economics and health services research aimed at providing causal inference regarding the effect of a particular variable (the causal variable – X) and outcome of interest (Y). Such models are typically used to explain (predict) past (future) economic behavior, test an economic theory, or evaluate a past or prospective policy intervention. Common to all such applied contexts is the need to infer the effect of a counterfactual ceteris paribus exogenous change in X on Y, using statistical results obtained from survey data in which observed differences in X are neither ceteris paribus nor exogenous. The current article focuses on a particular survey context in which such lack of exogenous (and ceteris paribus) control in sampling can lead to bias in causal inference and prediction. In these cases, values of the outcome Y are not observable for all members of the relevant population and sampling preclusion is not random. Instead, it is governed by a systematic sample selection (SS) rule which is determined by both observable and unobservable factors. If the unobservable (and, therefore, uncontrollable) factors in the SS rule are common to (or correlated with) unobservable determinants of the outcome, then econometric methods that fail to take account of such correlation will likely produce biased estimates of causal effects. Selection bias will also plague predictions of the outcome obtained from such naive methods. For example, suppose a particular prescription (Rx) drug (henceforth, the drug) is under consideration for future deregulation and over-the-counter (OTC) sale, it would be of interest (e.g., to the producer of the drug) to know the deregulated (OTC) price elasticity of demand for the drug. In this example, Y is OTC consumption (demand) and X denotes drug price (or out-of-pocket (OOP) per unit payment for the drug). Until the drug is cleared for OTC sales (deregulated), only data on on Rx drug prices (or OOP payments) and Rx consumption can be obtained. Would-be OTC purchases cannot be observed because, before deregulation, the drug in question can only be purchased by prescription. In this case, the requirement that a prescription be obtained from a physician serves as a systematic selection rule precluding would-be OTC purchasers from consuming the drug. Suppose that

(a) physicians tend to prescribe the drug in question for the more severely ill,

(b) illness severity is unobserved and positively correlated with OTC demand for the drug, and

(c) price negatively affects physician prescribing behavior, then applying a method that ignores these facts (e.g. ordinary least squares (OLS)) to a sample of patients for whom the drug has been prescribed, and who are, therefore, consuming the drug, will likely produce a price elasticity estimate that understates the truth (i.e., is less negative than it should be).

As another example, suppose one seeks to predict the would-be utilization of a particular type of healthcare by currently uninsured individuals, if they were to become insured. One might consider OLS estimation of a health care utilization regression using a sample of insured individuals. The OLS results would then be used to predict utilization for the uninsured as if they were instead insured. Suppose, however, that the following are true:

(d) unobserved health status influences the probability of being insured (adverse selection or cream skimming)

(e) unobserved health status affects healthcare utilization, and

(f) the true predictor model differs between the insured and uninsured, then applying a prediction method (e.g., best linear prediction via OLS) that ignores these facts, to a sample of insured individuals, will likely be biased when used as a predictor of healthcare utilization for the uninsured if given coverage.

The remainder of the article is organized as follows. In the following section, a more formal discussion of SS bias is presented. In Section Using Control Functions to Correct for Sample Selection Bias, a commonly implemented remedy for such bias is discussed. The discussion therein begins with linear models. The role of instrumental variables (IVs) in this context is also discussed. More commonly, encountered (in health economics and health services research) nonlinear models and estimation methods are then considered. The final section summarizes and concludes.

## Sample Selection Bias Because Of Unobserved Confounders

At issue here is the presence of confounding variables which: (1) serve to mask the true causal effect of X on Y (TCE); (2) or bias predictions of Y that ignore them. Define a confounder as a variable that is correlated with both Y and: X; sample inclusion (S = 1 if included, S = 0 if not); or both. Confounders may be observable or unobservable (denoted C_{o} and C_{u}, respectively – in the current discussion both are assumed to be scalars (i.e., not vectors)). Here it is also assumed that there are no unobservable confounders for X (or for C_{o}) – that is, these variables are assumed to be exogenous. Observations on C_{o} can be obtained from the survey data, so its influence can be controlled in the modeling of Y and S and, therefore, in the estimation of the TCE. Clearly, C_{o} can be directly implemented in the prediction of Y. It may be assumed, however, that the correlation between C_{u }and Y, and between C_{u} and S cannot be ignored (e.g., unobserved factors that are correlated with OTC drug demand are also correlated with physician prescribing behavior). Moreover, one cannot directly control for C_{u} because it is unobservable. If left unaccounted for, the presence of C_{u }will likely cause bias in statistical inference regarding TCE and prediction. This happens because estimation methods that ignore the presence of C_{u} will spuriously attribute to X (and C_{o}) observed differences in Y that are, in fact, because of C_{u}. Such bias is referred to as SS bias (or bias because of SS on unobservable confounders). SS bias can be formally characterized in a useful way. For simplicity of exposition, the true causal relationship between X and Y can be cast as linear and be written as

where β is the parameter that captures the TCE, β_{o} and β_{u} are parametric coefficients for the confounders, and e is the random error term (without loss of generality, it may be assumed that the Y intercept is zero). The SS rule determining the observability of Y can be modeled as

where the a’s are parameters, W is an IV (i.e., an observable variable that is correlated with neither C_{u} nor Y – more on this later), and I(A) denotes the indicator function whose value is 1 if condition A holds and 0 otherwise. In the naive approach to prediction and the estimation of the TCE (ignoring the presence of C_{u}), the ordinary least squares method (OLS) would be applied to

using only observations for which values of Y are observed, where the b’s are parameters and e is the random error term. The parameter b is taken to represent the TCE and Xb+C_{o}b_{o} is the relevant Y predictor. Correspondingly, b (the OLS estimate of b) estimates the TCE, and Xb+C_{o}b_{o} (with b_{o} being the OLS estimate of b_{o}) is the estimated predictor. It can be shown that OLS will produce unbiased estimates of b and b_{o }(here and henceforth, when unbiasedness is referred to it is done so in the context of large samples). It is also easy to show, however, that under general conditions

and

where b_{Xλ} is a measure of the correlation between X and λ (a function of X_{a}+C_{o}a_{o}+W_{aW} – more on this later), and b_{Coλ} is similarly defined. As is clear from eqn [4], the bias of the naïve OLS estimate of the TCE (b) is b_{Xλ}β_{u}. Moreover, it can be shown that the sign of b_{xλ} is opposite that of the parameter a in eqn [2]. Consider the OTC drug demand example discussed in the introduction. Here, the TCE of price on OTC demand is β and under hypotheticals (a), (b), and (c), both β_{u} and b_{Xλ} are positive (the latter because a is negative). By the law of demand, β should be negative and, because b_{Xλ}β_{u} (the bias) is positive, the OLS estimate of the price effect on OTC demand will likely understate the true price effect (in absolute value). By a similar argument, in the insurance coverage and healthcare utilization example discussed earlier, hypotheticals (d), (e), and (f) imply that the bias of the OLS predictor Xb+C_{o}b_{o} is X(b_{Xλ}β_{u})+C_{o}(b_{Cuλ}β_{u}).

An approach to estimation is needed that, unlike OLS applied to eqn [3], does not ignore the presence of, and potential SS bias because of, C_{u}. In the following section, methods that correct for selection bias through the inclusion of a control function which accounts for C_{u} are discussed. Such control functions also exploit sample variation in the IV (W) to eliminate SS bias because of correlation between C_{u} and S (more on this later).

## Using Control Functions To Correct For Sample Selection Bias

As eqn [1] demonstrates, if C_{u }were observable then unbiased estimates of β and β_{o} could be obtained by applying simple OLS to (1) using the selected sample (i.e., the subsample with observable data on Y). As it turns out, if C_{u }(albeit unobservable) is assumed to follow a given probability distribution then, based on eqn [2], it can be shown that the following is true for the subset of the population with observable data on Y

whereλ is a function of X_{a}+C_{o}a_{o}+Wa_{w} and v is the random error term possessing all of the requisite properties for unbiased regression estimation. This control function, which may be more explicitly stated as λ(Xa+C_{o}a_{o}+Wa_{w}), is the λ which is referred to in eqn [4] and eqn [5]. Its direct inclusion as a regressor in eqn [6] would serve to eliminate the SS bias plaguing regression estimation based on eqn [3] – made explicit in eqn [4] and eqn [5] for OLS. Strictly speaking, however, this is not feasible because λ involves the unknown parameters a, a_{o}, and a_{W}. Remainder of this section considers feasible linear and nonlinear estimators designed to circumvent the nonobservability of λ while producing unbiased estimates of the TCE of X and an accurate predictor for Y.

### Unbiased Estimation Of β And β_{o} In Linear Models

Despite the fact that λ is not directly observable, the parameters of linear models like eqn [1] can be estimated via the following two-stage method. First, estimate a, a_{o}, and a_{W} using the appropriate binary response model for eqn [2]. For example, if C_{u} is standard normally distributed, then estimates of the parameters of eqn [2] can be obtained by applying conventional probit analysis to a sample comprising observations with and without observable values of Y (i.e., both ‘selected’ and nonselected observations). The control function l can then be estimated as λ(Xa+C_{o}a_{o} +Wa_{W}), where a, a_{o} , and a_{W} are the first-stage parameter estimates. In the second stage, unbiased estimates of β and β_{o} can be obtained by applying OLS to

using the subsample of observations for whom Y is observable (S=1), where λ is the first-stage estimated value of the control function λ. If C_{u }is assumed to be standard normally distributed, then λ will have the familiar inverse Mill’s ratio form and the two-stage estimator described here coincides with the Heckman-type SS model.

The inclusion of W (the IV) in eqn [2], and in the formulation of λ, warrants some discussion. Note that if the need to control for the unobservable (C_{u}) in eqn [1] could be eliminated, the main source of selection bias would be neutralized and unbiased estimates of β and β_{o} could be obtained by applying OLS to

where e_{*} is a random error term that fulfills the conditions for the unbiasedness of OLS. Note that it would not be required to control for C_{u} in eqn [1] if it were, indeed, NOT a confounder; in which case one could legitimately set β_{u} equal to zero. One way to break the confounding link between C_{u} and S would be to randomize the SS rule. Unfortunately, in applied health economics and health services research, as in other social sciences, explicit randomization (experimentation) is often prohibitively costly or ethically infeasible. A form of pseudorandomization is, however, possible in the context of survey (nonexperimental) data. If, for instance, a variable that is observed as one of the survey items is highly correlated with S but is correlated with neither Y nor C_{u}, then the sample variation (across observations) in the value of that variable can be viewed as providing variation in S that is not correlated with C_{u }– a kind of pseudorandomization for S. The IV W which was included in eqn [2] serves this purpose.

In the context of the OTC example discussed earlier, any variable that affects physician prescribing behavior, but is not correlated with OTC demand for the drug would be an IV candidate. For example, measures of individual physician overall preference for prescribing the drug have been used as IVs in similar empirical contexts. Likewise in the insurance coverage healthcare utilization prediction example, any observable variable that influences the likelihood of coverage that is not directly correlated with the type of healthcare usage in question can be used as an IV. For example, the existence and features of state-level government programs aimed at facilitating the acquisition of health insurance coverage have been used for this purpose.

It should be noted here that the inclusion of W in eqn [2] (and by implication in λ) is not required for the technical legitimacy, feasibility, or unbiasedness of the two-stage estimator described earlier. Notwithstanding this fact, applications of the two-stage estimator that do not include an IV – so-called identification solely via functional form – are generally viewed as lacking.

### Unbiased Estimation In Nonlinear Models

The linear model (as specified in eqn [1]) does not conform to most empirical contexts in health economics. In most applied settings, the range of the outcome is limited in a way that makes a nonlinear specification more sensible. For example, the researcher is often interested in estimating the TCE of X on whether or not an individual will engage in a specified health related-behavior. In this case, the outcome of interest is binary so that a nonlinear specification of the true causal model would likely be more appropriate. In the OTC drug demand model discussed earlier, the outcome of interest (drug consumption) is nonnegative. An exponential regression specification of the true causal model is more in line with this feature of the data than is the linear specification in eqn [1]. Another common example of inherent nonlinearity in health economics and health services research, is in the modeling of healthcare expenditure or utilization (E/U). It is typical to observe a large proportion of zero values for the E/U outcome. In this and similar empirical contexts the two-part model (2PM) has been widely implemented. The 2PM allows the process governing observation at zero (e.g., whether or not the individual uses the healthcare service) to systematically differ from that which determines nonzero observations (e.g., the amount the individual uses (or spends on) the service conditional on at least some use). The former can be described as the hurdle component of the model, and the latter is often called the levels part of the model. Both of these components are nonlinear – binary response model for the hurdle; nonnegative regression for E/U levels given some utilization.

To accommodate these and other cases, the generic nonlinear version of the true causal model in eqn [1] can be written as

where μ(X,C_{o},C_{u}; θ) is known except for the parameter vector θ. It is very often assumed that μ(X,C_{o},C_{u}; θ) = M(Xβ+C_{o}β_{o} +C_{u}β_{u}), where M( ) is a known function and θ= [β β_{o} β_{u}]. In this linear index form the true causal models corresponding to binary and nonnegative outcomes are commonly written, respectively, as

and

where F( ) is a function whose domain is unit interval. It is to be noted here that for the generic nonlinear model characterized by eqn [9] the TCE is not embodied in any particular parameter (e.g., β) as in the linear models defined by eqn [1]. Instead, the TCE will be a nonlinear function of all parameters (θ) and all of the right-hand side variables (X, C_{o}, C_{u}) of the model. Moreover, the exact form of the TCE in nonlinear settings will differ depending on the researcher’s policy relevant analytic objective(s). The current discussion focuses on estimation of the vector of parameters θ.

The nonlinear generality of eqn [9] brings with it considerable, though not insurmountable, complications in the formulation of the nonlinear analog to eqn [6]. In the generic nonlinear model, although the SS rule is still defined as in eqn [1], the relevant control function is implicit and does not have a closed form, as did λ(X_{a}+C_{o}a_{o} +Wa_{w}) in the linear case. In light of this, accounting for the presence of C_{u} in eqn [9] does not involve a simple substitution of the control function, as was the case in moving from eqn [1] to eqn [6]. With these issues in mind, the nonlinear analog to eqn [6] can be written as

where μ*(X, C_{o}, C_{u}, W; a, θ) is a known function derived from eqn [9] and a= [a a_{o} a_{w}]. Unbiased estimates of the parameters of eqn [12] can be obtained via the following two-stage protocol. First, estimate a* as in the linear case. In the second stage, estimate of θ by applying the nonlinear least squares method to the following version of eqn [12].

where a*=[a a_{o}a_{w}] is the first stage estimate of a*. It should be noted that the only case in which eqn [12] has a closed form is when it is derived from eqn [11] – the exponential regression model. In that case, eqn [13] can be written

where the estimated control function λ* is a function of Xa+C_{o}a_{o} +Wa_{W}.

## Summary

Often sample inclusion is not random. Unobservable determinants of sample inclusion may also influence the outcome of interest. Naive regression estimates that ignore such latent correlation are subject to a kind of endogeneity bias – so-called SS bias. SS bias in regression parameter estimation is also manifested in corresponding causal inference and prediction. In this article the general circumstances in which SS bias is likely to be a problem is detailed and examples are given. Most empirical studies that confront potential SS bias are cast in a linear framework and implement a relatively simple two-stage method to correct for it. This method, and also the sources and implications of SS bias in linear models, are discussed in detail. Linear models and methods are not, however, compatible with most empirical contexts in health economics and health services research which often involve outcomes that are qualitative or otherwise limited in range. For this reason, a general nonlinear framework for modeling potential SS is also discussed, and a recently developed two-stage estimation approach (details of which can be found in the references for Further Reading) is outlined.

**References:**

- Gronau, R. (1974). Wage comparisons – A selectivity bias. Journal of Political Economy 82, 1119–1143.
- Heckman, J. (1974). Shadow prices, market wages, and labor supply. Econometrica 42, 679–694.
- Heckman, J. (1976). The Common structure of statistical models of truncation sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5, 475–492.
- Heckman, J. (1979). Sample selection bias as a specification error. Econometrica 47, 153–161.
- Olsen, R. J. (1980). A least squares correction for selectivity bias. Econometrica 48, 1815–1820.
- Ray, S. C., Berk, R. A. and Bielby, W. T. (1980). Correcting sample selection bias for bivariate logistic distribution of disturbances. In: Proceedings of the Business and Economics Section of the American Statistical Association, pp. 456–459. Alexandria, Virgina: American Statistical Association.
- Terza, J. V. (1998). Estimating count data models with endogenous switching: Sample selection and endogenous treatment effects. Journal of Econometrics 84, 129–154.
- Terza, J. V. (2009). Parametric nonlinear regression with endogenous switching. Econometric Reviews 28, 555–580.
- Terza, J. V. and Tsai, W. (2006). Censored probit estimation with correlation near the boundary: A useful reparameterization. Review of Applied Economics 2, 1–12.