|
|
||||||||||
|
J Am Coll Cardiol, 2003; 42:1896-1899, doi:10.1016/j.jacc.2003.09.008 © 2003 by the American College of Cardiology Foundation |







,*
Harvard Medical School, Boston, Massachusetts, USA
Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
Brigham and Women's Hospital, Boston, Massachusetts, USA.
* Reprint requests and correspondence: Dr. Donald S. Baim, Professor of Medicine, Harvard Medical School, and Director, Center for Integration of Medicine and Innovative Technology (CIMIT), Brigham and Women's Hospital, Boston, Massachusetts 02115, USA.
dbaim{at}partners.org
Even in the earliest years of coronary angioplasty, Andreas Gruntzig established the precedent of collecting detailed demographic and angiographic data as well as short- and long-term procedural outcomes of his patients. As the number of coronary interventions increased in the mid- and late 1980s, such data collection continued and was used to populate databases whose summary outcomes could be examined. The availability of personal computers and powerful statistical tools then allowed these databases to be examined in greater detail to identify correlates of particular outcomes of interest. In fact, the very evolution of modern percutaneous intervention has been driven by the information gained from these clinical databases, whether large single-center experiences, regional registries, medical society or government-sponsored initiatives, or multi-center clinical trials. The extent to which we are able to properly measure this type of data and use our analysis to drive improvements in practice will determine whether the quality of care and the outcome after percutaneous coronary intervention (PCI) will continue to improve.
The study by Qureshi et al. (1) in this issue of the Journal is the latest in that series of efforts that concentrate on defining the most important clinical variables for predicting the single most important adverse outcome: in-hospital death. The facts are familiar: the overall mortality is 1.3%, but there are some factors (acute myocardial infarction [MI], age, multi-vessel disease, and baseline renal dysfunction) whose presence separately or in combination increases the likelihood of death to 30% and whose absence reduces the risk of mortality to 0.2%. What uses can we expect to make from this model, and how does it differ from earlier models?
| The role of outcomes databases |
|---|
|
|
|---|
Of equal importance are the statistical tools used to analyze the dataset and a clear understanding of their robustness for the particular forecasting uses that are planned. Published estimates of risk for an individual patient may aid the patient and family in the consenting process or assist the operator in selecting or avoiding specific devices or adjunctive pharmacotherapy. The certainty of this type of prediction, however, is limited if there are differences between the model set and the patients to whom the model is being applied (some differences not being fully captured in standard angiographic and clinical variables) or if there are other statistical issues such as sampling variability and random variability in operator performance. These limitations are of particular concern when the model is going to be used to provide a performance "scorecard" for other operators.
| Developing a risk prediction model |
|---|
|
|
|---|
The currently available models have good utility, but each has some level of limitation (29). Several were developed within single centers (6,8,9), specialized centers (4), or particular geographic regions (2,5) and thus may have limited generalizability to other populations. Many years may elapse between the collection of data, analysis, and eventual publication, compromising applicability to contemporary practice by the time the results are available. Moreover, robust models require a large sample size to predict outcomes that occur infrequently. For in-hospital mortality after PCI, a 10,000-patient database with a 1.5% overall mortality has only 150 events, which limits the number of variables it can test (roughly 20 events are required for each variable tested). Most of the databases used for model development and validation have thus included far too few patients for complex models of mortality prediction.
The quality of any model must also be measured carefully. A model that is constructed using a given population (the test set) is then validated by testing the model either in another portion of the same databaseby jack-knifing or bootstrappingor in a separate external database (the validation set). These procedures reduce the chance that a detected predictor was due to a unique property of the test set rather than being a robust predictor. The statistical quality of the proposed models relies on two measures regarded as measures of quality: discrimination and goodness-of-fit. Discrimination is usually measured by the c statistic. This reflects the area under the receiver-operating curve and thus is a measure of the model's ability to assign true positive outcomes as opposed to false positives. Models with a c statistic approaching 1 have perfect discrimination with a false-positive rate of 0% and a true-positive rate or sensitivity of 100%. Logistic regression models with c statistics in the range of 0.80 are usually considered to have high discriminatory ability, but this means that the model will still miss 20% of the patients with that adverse event. In fact, it would be only slightly better than a model with no discriminatory ability, which has an area under the received operating curve or straight line of 0.50! The goodness-of-fit is frequently assessed using the Hosmer-Lemeshow test, which determines the difference between the event rate predicted by the model and the observed rate. A p value >0.1 usually indicates that the model provides a good fit for the data and that differences are not statistically different, but it does not exclude potentially clinically significant differences between observed and predicted outcomes. Therefore, high scores for discrimination and goodness of fit do not necessarily mean that the model has high predictive accuracy for individual patients. A mortality predictor model for a population of 10,000 patients may thus predict a mortality of 10% for the highest decile, but even with a perfect fit, we do not know which 100 of those 1,000 patients will die.
The good news is that the available PCI mortality models provide some reassuring features despite these inherent limitations. First, each of the models included in Table 1 does have high discriminatory and goodness-of-fit scores. Second, even though the models represent patients from different eras and various populations, their strongest predictors are remarkably consistent and relate mostly to patient rather than technical variables. This has been substantiated in a recent report from the National Heart, Lung, and Blood Institute dynamic registry (NHLBI), in which three of the five tested models developed in the pre-stent era (New York State, Northern New England Cooperative Group, and Cleveland Clinic Foundation) showed excellent correlation for predicted and observed mortality among patients in the NHLBI database treated between 1997 and 1999 (10). This is somewhat less certain for angiographic lesion factors, however, because many of the lesion characteristics in the American College of Cardiology/American Heart Association classification scheme (e.g., lesion eccentricity) have been eliminated as technology has improved. Also, some of the remaining angiographic variables are actually surrogates for basic clinical variables (e.g., recent total occlusion is a surrogate for acute MI) (8).
|
But knowledge of the most reliable predictors should also allow comparison of outcomes observed for different operators or hospitals to the outcomes expected based on the predictive model. This would ideally allow appropriate and complete adjustment of the treated population for significant differences in baseline risk, and thus allow fair comparisons between operators or hospitals (the rainfall in July question). There are several concerns, however, with using this model for comparing different operators and hospitals in a scorecard fashion. The boundary selected by Qureshi for each of the four variables is arbitrary and does not delineate among various levels of increased risk. For example, although patients over age 65 are at higher risk, this risk is certainly higher for an 86-year-old than a 66-year-old patient. Likewise, no one would question a higher overall risk for patients with MI within 14 days, but the highest-risk patients would be those being treated for acute MI within 24 h, particularly if they have hemodynamic instability. Similar arguments can be made for the other two variables of creatinine >1.5 mg/dl and multi-vessel disease, which the model considers as unqualified binary variables. This considerable degree of smoothing of the overall risk curve by using these dichotomous cutoffs may lead to a systematic underestimation of the risk for the truly high-risk patient and a significant overestimation of the risk for many other patients, thus failing to adjust adequately for higher- or lower-risk cohorts across operators and hospitals for which the distribution of variable values may differ from the test set.
| Risk adjustment models, scorecards, and quality improvement |
|---|
|
|
|---|
Unfortunately, even the best and most sophisticated multiple regression models developed for the purpose of risk adjustment have serious deficiencies and limitations, as discussed previously. Moreover, even the best models cannot compensate for the smaller sample sizes present at the institution or operator level and the associated statistical uncertainty. The wide resulting confidence intervals make it virtually impossible to provide any meaningful estimation of appropriateness of outcome for the low-volume operator or institution. A low-volume operator may look very good or very bad depending on how his or her last case went, and such models cannot fully correct for all confounders of risk in a small sample size. There are additional problems, such as failure to account for sampling variability, unmeasured confounding, and random variability (noise) between operators that are not fully correctable by any model, so that when resulting data are disclosed publicly, any risk-adjustment effort must be viewed as imperfect rather than as a true leveling of the playing field.
Given these significant problems with multiple regression models, Shahian et al. (11) have suggested the use of hierarchical or random-effect models for risk adjustment among cardiac surgery providers. The hierarchical models reduce the overly optimistic precision estimates by attempting to adjust for confounding by variances in treatment decisions between physicians and patients in the predictor dataset. Accounting for random operator effects dampens variability toward the mean and thereby provides more reliable estimates (12). Although they are much more complex, they are not beyond the capacity of groups involved in the risk-adjustment exercise.
In summary, the objective for any risk-prediction or adjustment tool should be to foster continuous quality improvement. Although simple bedside scoring as proposed by Qureshi et al. (1) may be of some use for classifying patients into broad risk categories, the ramifications of bona-fide risk adjustment demand more complex systems. Public presentation of the results must be undertaken cautiously and with adequate explanation of limitations to avoid unnecessary punitive components that might lead to gaming of the system (e.g., by avoiding high-risk cases, which may deny benefit to the patients with the most to gain from a high-quality procedure). Although the tracking of performance scores within individual centers and the comparisons with regional or national standards are desirable, those centers should also implement the minimum volume standards that have been shown to be reasonable, if not perfect, surrogates for performance quality (4,13). Finally, it is not clear whether mortality is the appropriate outcome measure, given its low frequency and the increasing difficulty in predicting risk as the frequency of the studied event diminishes. Other ways of measuring the success of a procedure and sound judgment, rather than the natural history of an acute illness, may be more useful. Physician-led continuous quality improvement initiatives that include the reporting of specified measurements of the success of a procedure have been effective in cardiac surgery (14). Regardless of the statistical methods used, however, the goal of continuous quality improvement is essential to our delivering the brightest forecast for the safety of our interventional cardiology patients.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Halkin, M. Singh, E. Nikolsky, C. L. Grines, J. E. Tcheng, E. Garcia, D. A. Cox, M. Turco, T. D. Stuckey, Y. Na, et al. Prediction of Mortality After Primary Percutaneous Coronary Intervention for Acute Myocardial Infarction: The CADILLAC Risk Score J. Am. Coll. Cardiol., May 3, 2005; 45(9): 1397 - 1405. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | SUBSCRIPTIONS | CURRENT ISSUE | PAST ISSUES | CARDIOSOURCE | SEARCH | HELP | FEEDBACK |