Cost-effectiveness analyses of health interventions and policies are often conducted using quality-adjusted life-years (QALYs) as the metric for quantifying health outcomes. A related metric called disability-adjusted life years (DALYs) has been used to assess the burden of disease attributable to different causes as well as in cost-effectiveness analyses, especially those in low and middle-income settings. Both QALYs and DALYs provide summary measures of health outcomes that (1) combine information on survivorship and the health experience among the living; and (2) accommodate comparisons across diverse types of health problems by expressing outcomes in a common ‘currency’. A critical feature of DALYs and QALYs is that they attach weights to time spent in different states of health to reflect the relative severity of these outcomes. These weights have been called, among other names, ‘health state valuations,’ ‘health-related quality of life weights,’ and ‘health utilities,’ with some associated variation in the interpretation of the meaning of the weights; in this article, these will be referred to as ‘health state valuations.’ Health state valuations are given on a scale that ranges from 0 to 1.0. For QALYs, 1.0 implies a state of optimal health and 0 implies a state equivalent to being dead, whereas for DALYs, the scale is reversed: 0 implies no health loss, whereas 1.0 implies severity equivalent to being dead. For use in QALY and DALY calculations, health state valuations must have interval scale properties, that is, differences between two values on the scale must be meaningful, with a given distance between two scale values having the same significance, no matter where the points are located on the scale. For instance, the difference between 0.4 and 0.6 must be understood as equal to the difference between 0.7 and 0.9.
Various methodological and empirical issues relating to health state valuations have inspired a rich and growing literature. The focus of this article is on techniques for eliciting valuations. Where other relevant topics are mentioned, cross-references: are provided to those articles in which these topics are treated in greater detail.
Overview Of Techniques
There are six types of techniques that have been used prominently in eliciting health state valuations.
The standard gamble is a method that has its theoretical basis in the von Neumann–Morgenstern axioms of expected utility theory. It aims at measuring the ‘disutility’ of a health state by observing the willingness to accept a certain risk of death in order to avoid the state. In a typical framing of the standard gamble, a respondent is asked to consider a choice between two alternatives. In alternative A, the person would live with a particular health problem (the one for which the valuation is needed) with certainty, for the remainder of his or her life. Alternative B is usually characterized as a risky treatment, with two possible outcomes: life in a state of optimal health, with probability p, or immediate death, with probability (1-p). The measurement objective for the standard gamble is to identify the probability of optimal health, p, at which the respondent is ‘indifferent’ between alternatives A and B, in other words, the point at which the two alternatives seem equally attractive. Once this indifference point is identified, a health state valuation for the particular health problem of interest is equal to p. The logic of this inference derives from setting the utility of optimal health to 1.0 and that of death to 0 and assuming that at the point of indifference, the respondent considers the expected utility of alternatives A and B to be the same. In mathematical terms, the equality is stated as p×U(optimal)+(1-p)×U(death) = U(health outcome), or p×1.0+(1-p) ×0= U(health outcome), which simplifies finally to U(health outcome) = p.
The time trade-off is another of the most widely used methods for eliciting health state valuations. Like the standard gamble, it invokes the notion of willingness to sacrifice something that is valued in order to avoid an inferior health state. In the standard gamble, what is sacrificed is the certainty of survival, whereas in the time trade-off, what is sacrificed is the length of life. The time trade-off asks respondents to consider a choice between two alternatives. The first is to survive for a specified amount of time, t1, with a particular health problem, followed by death. Different time trade-off studies have taken different approaches to defining t1, including using an arbitrary duration such as 10 years or using the respondent’s estimated life expectancy (or some rough approximation to this). The second alternative in the time trade-off is to survive a (presumably) shorter amount of time, t2, but in optimal health. The measurement approach in the time trade-off is usually to hold t1 constant and vary the amount of time t2 until the indifference point is identified. A health state valuation may then be computed as the ratio t2/t1. The logic of this inference, similarly to the standard gamble, is to equate the overall value between the two alternatives at the respondent’s indifference point. In case of the time trade-off, the value of an alternative is taken to be the product of its health valuation and its duration. Thus, the indifference point implies the equality U(health outcome) × t1 = U(optimal health) × t2. Again taking the valuation of optimal health to be 1.0, this simplifies to U(health outcome) = t2/t1.
The rating scale approach comes from psychometrics. In contrast to both the standard gamble and time trade-off, it consists in eliciting a numerical valuation for a health outcome directly, without invoking the notion of sacrifice. A series of health outcomes are often simultaneously located on a numerical scale such that a respondent evaluating outcomes A, B, and C must consider whether A is preferable to B, B preferable to C, and A preferable to C, and also to decide the strength of these preferences, in other words the distances between them on the numerical scale. A number of different ways of operationalizing a rating scale are possible but most feature a straight line with marked intervals (like a meter stick), with the endpoints marked with numbers (e.g., 1.0 and 0 or 100 and 0) and labels referring to the best outcome (e.g., as ‘perfect health’ or ‘best imaginable health state’) and the worst outcome (e.g., as ‘worst imaginable health state’ or ‘dead’). Sometimes the marked intervals are accompanied by numerical labels. Rating scales can also be constructed without marked intervals, in which case they are called ‘visual analogue scales.’ In practice, the latter term is sometimes used in a generic way that includes both marked and unmarked scales. There has also been variation in practice concerning the range of the scale. If researchers want to accommodate states that are regarded as worse than being dead, then the scale spans from the best to the worst imaginable outcomes, and respondents are asked to locate ‘dead’ on the scale amidst one or more nonfatal outcomes. Issues around states regarded as worse than being dead are mentioned in the Section on Key Conceptual and Methodological Issues. Typically, health state valuations are derived from rating scales by taking the ratio of the distance between a particular state and the point on the scale assigned to ‘dead’, divided by the distance between the upper endpoint of the scale (the best outcome) and the point assigned to ‘dead.’
Another technique that arose, like the rating scale, from the direct measurement tradition of psychometrics is magnitude estimation (sometimes called ‘ratio scaling’). In this approach, a respondent is given one health state as a reference benchmark, and then asked to indicate how many times better or worse some other states are compared with the reference state. Sometimes, the reference state has been defined as an endpoint of the scale (e.g., the most desirable outcome), although other studies have chosen an intermediate state as the reference. For example, a seminal magnitude estimation study by Patrick et al. (1973) anchored comparisons to a reference item describing ‘a day in the life of a person who was as healthy as possible on that day,’ which was assigned an arbitrary score of 1000. Other days were to be scored in relation to this reference, for instance, the instructions noted that a day that was regarded as ‘half as desirable as the standard’ should be scored at 500. Based on this scheme, results may be rescaled to the unit interval simply by dividing the scores by 1000. In fact, by translating the ratios into scores in this way, the operationalization of the task comes to bear a strong resemblance to the rating scale. Another influential magnitude estimation study by Rosser and Kind (1978), anchored the comparison task with the second best outcome (no disability and mild distress) as the reference, and asked respondents to indicate how many times worse other states were compared with this reference. In this case, rescaling of the results depended on normalizing the scale, so that the best state had a value of 1.0 and death a value of 0. Thus, if death were regarded as 200 times worse than the reference state, this would result in the reference state having a value of (1–(1/200)) = 0.995; a state that was considered 10 times as bad as the reference state would then have a value of (1–10×(1–0.995)) = 0.95.
The person trade-off is a technique that has been used less commonly than many of the other techniques mentioned so far. Unlike these other techniques, the person trade-off asks respondents to answer from the perspective of a social decision maker considering alternative policy choices rather than as an individual making choices for himself or herself. The person trade-off has been framed in various ways, but a typical presentation asks respondents to consider two options, one that will result in longer survivorship for a group of people and the other that will result in prevention of a nonfatal, usually chronic, condition. For example, a respondent might be asked to weigh an option that would prevent x1 =1000 deaths in a healthy population versus an alternative that would prevent x2 = 5000 cases of some particular chronic disease outcome. The measurement approach in the person trade-off would usually be to hold constant the number of averted deaths (x1) and vary the number of nonfatal outcomes averted (x2) to find the indifference point between the alternatives. A health state valuation for the nonfatal outcome being considered would then be computed as (1- (x1/x2)). For instance, x2 =10 000 would yield a value of 0.9.
Ordinal Response Methods
Finally, there has been renewed interest recently in ordinal response methods. Over much of the history of measuring health state valuations, ordinal methods such as ranking have been deployed primarily as a ‘warm-up’ exercise, for example, as a preliminary step to eliciting rating scale values for a range of health states. However, there have been a number of examples of analyzing ordinal response data in order to infer latent cardinal values that are consistent with these responses, and these examples have grown numerous over the past several years. Methods for collecting ordinal responses fall into two main categories: (1) rank ordering of health states and (2) paired comparisons of health states, residing within the broader methodological tradition of discrete choice analysis. Analysis of ordinal response information has been based largely on a random utility framework operationalized using regression models for discrete outcomes. These models are based on the presumption that ordinal responses may be related to differences between values on an unobserved cardinal scale. Specifically, regression-based approaches formalize the intuitive notion that two states that are distant from each other on some underlying measurement scale are more likely to produce agreement in the pairwise ordering of the outcomes than will two states that are very near to each other. If distributions of values on the latent scale are assumed to be normal, then the ordinal responses can be modeled using probit regression; analogously, the assumption that the values follow an extreme value distribution leads to logit regression.
Historical Development Of Health State Valuation Techniques
The development and adaptation of techniques for measuring health state valuations have occurred mostly since the early 1970s, but historical antecedents for this work may be found decades earlier. The history of the standard gamble is perhaps easiest to trace, as the technique debuted alongside the introduction of the expected utility theorem of von Neumann and Morgenstern (1944). Following this introduction, various approaches were proposed to assess von Neumann– Morgenstern utilities through specific types of standard gamble comparisons. The type of comparison that has been commonly adopted for use in health state valuations, in which the respondent chooses between a certain prospect of an intermediate outcome on the one hand and a gamble with the best and worst extreme outcomes on the other, was featured originally by von Neumann and Morgenstern and used subsequently in formulations for general utility assessment (i.e., not specific to health) by Frederick Mosteller, Duncan Luce, Howard Raiffa, and others. In 1968, Arnold Packer explicitly noted the applicability of the standard gamble to evaluation of health programs. Torrance et al. (1972) presented what may be the earliest published example of a comprehensive approach to measuring effectiveness of health programs with a utility assessment strategy based on the standard gamble. George Torrance had previously (in an unpublished dissertation, in 1971) undertaken a pilot test of the standard gamble technique, among others, in the context of a health care program evaluation.
The time trade-off approach, as currently implemented in valuing health states, appears to have been devised and named in the same study comprising Torrance’s dissertation work, although the basic approach was discussed around the same time by Fanshel and James (1970), who referred to the notion as ‘weighting through equivalence in time.’ Torrance himself described the time trade-off as evolving from the so-called ‘direct measurement technique’ attributed to the psychologist Stanley Smith Stevens (1959), although the final format of the time trade-off bears little resemblance to this earlier proposal to directly elicit ratio assessments for two quantities. The time trade-off was originally developed to assess values for states considered better than being dead; another important milestone in development of the time trade-off was the elaboration by Torrance in 1982 of the method to accommodate states regarded as worse than dead.
Rating scales in health state valuation draw on a long history of related scaling approaches used in psychology and attitude measurement, including work by Louis Leon Thurstone in the 1920s. Patrick et al. (1973) applied a ‘category scaling’ approach to health measurement based on the method of equal-appearing intervals attributed to work published by Warren Torgerson in 1958. Patrick et al. operationalized this approach by having respondents place cards labeled with various health outcomes into equally spaced slots in a desk file sorter, numbered between 0 and 16. In another study published in the same year, the same authors used a linear rating scale, which has become the conventional rating scale approach in health state valuation. Subsequently, George Torrance adapted the approach with a ‘desirability line’ representing 101 equal interval categories spanning the range between ‘Death, Least Desirable’ and ‘Healthy, Most Desirable.’ Following these early precedents, numerous applications of rating scales in health state valuation have introduced a number of variations on this basic theme.
Magnitude estimation was proposed by Stevens (1951), in part as a response to the chief limitation he saw in the use of rating scales, which is that responses on rating scales appear to be nonlinearly related to the actual underlying scale that is being measured. Patrick et al. (1973), citing earlier work from the field of criminology, presented what appears to be the first application of magnitude estimation in valuation of health states. Another prominent use of the technique was in the Rosser and Kind index in 1978.
The person trade-off approach was named by Erik Nord in 1992, but the technique itself was applied already by Patrick et al. (1973) under the name of the ‘equivalence’ method. A proposal for ‘weighting by equivalence in population’ had appeared in the work of Fanshel and Bush (1970), but that earlier study presented the concept without applying it in empirical study. The person trade-off gained prominence through the publication of a review and empirical study by Nord (1995), which summarized prior applied work using the person trade-off and related techniques and presented the first comprehensive assessment of the reliability and possible biases in the technique. The profile of the method was also raised by its adaptation in the measurement of disability weights for DALYs in the Global Burden of Disease Study, as described by Murray (1996). The DALY study used two variants of the person trade-off in a deliberative group exercise. One of these variants – which compared life extension among disabled and nondisabled groups and thus differed from the typical person trade-off format described above – inspired criticism from Trude Arnesen and Erik Nord in 1999 for its potential ethical implications.
One of the most recent trends in measuring health state valuations actually relates to one of the oldest methodological traditions, which concerns estimation of cardinal measures based on ordinal responses. In the 1920s, Louis Leon Thurstone developed the ‘law of comparative judgment’ that provides the conceptual foundation for most approaches to deriving cardinal values from ordinal assessments. Following Thurstone, Ralph Bradley and Milton Terry, Duncan Luce, and Daniel McFadden further developed the axiomatic basis for choice models and refined analytic approaches based on a random utility model. Kind (1982) presented the first application of the Bradley–Terry–Luce approach to health state valuation, and there has been a recent revival of interest in these methods due to the relative simplicity of eliciting ordinal responses and a widening range of analytic tools to accommodate these responses.
Key Conceptual And Methodological Issues
There have been various conceptual interpretations of health state valuations that have produced some amount of ambiguity in defining the basis for measuring and understanding these valuations. When valuations are measured with the standard gamble, some people refer to these valuations as ‘health utilities.’ In fact, some have suggested that the standard gamble is the only method that produces ‘utilities,’ according to the von Neumann–Morgenstern framework. Others are less restrictive in the use of this term. Richardson (1994) and others have questioned the primacy of the standard gamble and challenged the prevailing argument that the standard gamble is preferred because its inclusion of risk aligns the technique with the inherently uncertain nature of medical decision making.
There has also been variation in the use of terms like ‘quality of life’ or ‘health-related quality of life’ in reference to health state valuations. The term ‘quality of life’ has been used widely in various social science contexts to refer to the overall subjective appraisals of happiness or satisfaction experienced by individuals. In health, the term ‘quality of life’ has sometimes been used in a more particular way to refer to a multidimensional construct relating to symptoms, impairments, emotional states, and domains of functioning. Because this use of ‘quality of life’ diverges from more general uses of the term, health researchers often refer to the distinct construct of ‘health-related quality of life’. To the extent that an individual’s health-related quality of life is understood in terms of a vector of levels on ‘health-related’ dimensions of life, it is similar to the conceptual notion underlying health state valuation, which can be used to attach an overall scalar value to such a multidimensional profile. Where health-related quality of life is viewed in terms of the contribution of an individual’s health to his/her overall well-being, conceptual problems emerge from the fact that well-being is not clearly separable into independent health and nonhealth components (as, for instance, philosopher John Broome has argued).
In considering empirical differences between the different techniques for eliciting health state valuations, it is useful to recognize how the different constructs embodied in the techniques, for example, the ‘utility’ notion reflected in the standard gamble, may combine judgments about health with other values such as risk aversion. There has been a general consistency in the ordering of values (for the same state) produced by responses to the different valuation techniques, with rating scale values tending to be lowest (on a scale in which higher numbers imply better outcomes); standard gamble and person trade-off values highest; and time trade-off values tending to fall between these extremes. One interpretation of this typical finding is that the systematic variation across valuation techniques relates to the specific types of other values that are invoked by the particular framing of each technique. For example, a highly risk averse person will answer standard gamble questions in a way that produces values near 1.0, as the person will be unwilling to entertain even small probabilities of mortality. Several commentators have suggested that person trade-off responses are susceptible to an analogous set of values at the population level, which may be understood in terms of the ‘rule of rescue,’ by which respondents tend to choose a program that averts a relatively small number of deaths over a program that averts a very large number of nonfatal outcomes. Time trade-off responses may be influenced by a range of factors, such as discounting of future events, but the net effect of these factors may be relatively modest compared with the impact of risk aversion. Finally, various possible biases in rating scale responses have been considered, including a propensity to avoid values near the extreme ends of the scale, which is consistent with an overall downward shift in rating scale values.
Some health states are considered to be worse than being dead. Assignment of values to these has presented some challenges, especially in the use of the time trade-off. A typical approach to the time trade-off is first to ask whether a state is regarded as better than dead or worse than dead, as in the protocol developed by the Measurement and Valuation of Health (MVH) Group in 1994. For a worse-than-dead state, respondents in the MVH study were asked how many years spent in the health state (t) followed by a period of perfect health, summing to 10 years, would be equivalent to immediate death. By assigning values of 0 and 1.0 to dead and optimal health, respectively, valuations for a worse-than-dead outcome may be derived from the following equality: U(health outcome) × t + U(optimal health)× (10-t) =0, which simplifies to U(health outcome) = 1-(10/t). In principle, this implies that the weight for a worse-than-dead state falls in the interval (–∞, 0). In practice, the lowest possible valuation using the MVH protocol is -39 (due to reporting of responses in quarter-year increments). Several studies have observed that treating worse-than-dead responses as originally intended – although faithful to the conceptual development of the time trade-off question – can lead to a large number of health states having negative average valuations, challenging face validity. In response, George Torrance, Paul Dolan, Leida Lamers, and others have considered various transformations of the worse-than-dead responses, which have prompted some controversy and a range of alternative proposals.
A large and growing literature on health state valuation has been directed toward a range of key issues including: the choice of technique for eliciting valuations; whose values to elicit; related issues around changing valuations over time (e.g., due to adaptation to decreased function); how to describe states for valuation; and the relevance of other values that may influence responses to health state valuation questions. This article has introduced six prominent techniques for eliciting valuations, discussed certain milestones in the historical development and evolution of these techniques, and mentioned some of the most salient conceptual and methodological issues relating to measurement of health state valuations.
- Fanshel, S. and Bush, J. W. (1970). A health-status index and its application to health services outcomes. Operations Research 18, 1021–1066.
- Kind, P. (1982). A comparison of two models for scaling health indicators. International Journal of Epidemiology 11, 271–275.
- Murray, C. J. L. (1996). Rethinking DALYs. In Murray, C. J. L. and Lopez, A. D. (eds.) The global burden of disease: A comprehensive assessment of mortality and disability from diseases, injuries, and risk factors in 1990 and projected to 2020, pp. 1–98. Boston: Harvard School of Public Health.
- von Neumann, J. V. and Morgenstern, O. (1944). Theory of games and economic behavior. Princeton, NJ: Princeton University Press.
- Nord, E. (1995). The person-trade-off approach to valuing health care programs. Medical Decision Making 15, 201–208.
- Patrick, D. L., Bush, J. W. and Chen, M. M. (1973). Methods for measuring levels of well-being for a health status index. Health Services Research 8, 228–245.
- Richardson, J. (1994). Cost utility analysis: What should be measured? Social Science & Medicine 39, 7–21.
- Rosser, R. and Kind, P. (1978). A scale of valuations of states of illness: Is there a social consensus? International Journal of Epidemiology 7, 347–358.
- Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In Stevens, S. S. (ed.) Handbook of experimental psychology, pp. 1–49. New York: Wiley.
- Torrance, G. W., Thomas, W. H. and Sackett, D. L. (1972). A utility maximization model for evaluation of health care programs. Health Services Research 7, 118–133.
- Lamers, L. M. (2007). The transformation of utilities for health states worse than death: Consequences for the estimation of EQ-5D value sets. Medical Care 45, 238–244.
- McDowell, I. (2006). Measuring health. New York: Oxford University Press.
- Nord, E. (1992). Methods for quality adjustment of life years. Social Science & Medicine 34, 559–569.
- Salomon, J. A. (2003). Reconsidering the use of rankings in the valuation of health states: A model for estimating cardinal values from ordinal data. Population Health Metrics 1, 12.
- Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review 34, 273–286.
- Torrance, G. W. (1976). Social preferences for health states: an empirical evaluation of three measurement techniques. Socio-Economic Planning Sciences 10, 129–136.