November 18, 2020 Colin Beam, PhD
The field of machine learning (ML), a subset of artificial intelligence in which computer algorithms use statistics to find patterns in data that can predict future outcomes, is undergoing rapid development. New methods, many of which are available through open source tools, are highly flexible in that they can automatically approximate complex, nonlinear functional relationships between predictors and response variables.
It’s easy to understand the excitement and optimism, particularly in data-intensive industries such as healthcare, where transformation is so badly needed. However, the reality is that predictive modeling projects are hard and fraught with opportunities for missteps. For example, the risk of highly flexible methods is that they could overfit the data by describing idiosyncrasies in the training sample, thus leading to poor predictive performance. An essential feature of ML methods is that model flexibility is adjusted so as to best capture the patterns that will generalize to new contexts.
Is there a more direct path from hype to high performance? The answer could lie in not overthinking your approach.
Looking across industries, ML’s track record for success is not entirely encouraging, as two recent studies suggest:
An Alegion survey of 277 data scientists working in nearly 20 industries found that 78 percent of ML projects stalled before they could be deployed into production. Ninety-six percent encountered data quality and labeling challenges, which in turn affects how successfully the algorithms can be trained.
In a survey of 500 professionals across industries conducted by Dotscience, over 80 percent reported that it took more than 6 months to move a ML project into production, with almost 10 percent reporting it taking 24 months or longer. Nearly a quarter of respondents reported that collaboration was a primary challenge for model development. The Dotscience authors posit that underinvestment in version control was one key impediment to collaboration across teams.
With its complexity, healthcare can logically expect ML projects to experience the same difficulties, if not more, steady news of impressive ML achievements notwithstanding. AI appears poised to transform the field of radiology, for example, although considerable challenges remain.
Importantly, however, most healthcare analytics projects do not rely on high-quality signal and image data. Instead, they use noisy measurements from heterogenous data sources gathered from drifting populations. These contexts are especially challenging when attempting to detect subtle patterns, which could limit the application of advanced ML methods.
So, how do we overcome some of these challenges? A successful predictive analytics project starts with data that are ready, reliable, and clinically meaningful. Then, you must properly formulate the question that addresses the core business problem . These two steps must be complete to the degree that work can begin on the next phase: model development. In model development, you are selecting and training an algorithm, then determining whether it has performed well enough to be put into production.
The data science team has many options for its model development strategy. A sound strategy should both increase the chances of a successful implementation and limit losses if the project ultimately founders, which can happen despite the team’s best efforts.
On one end of the spectrum of model development is searching for the best possible method for a given problem, where “best” refers to the method with the lowest estimated prediction error on the relevant metric of interest. With this approach, search proceeds across algorithm classes, such as regularized regression, gradient boosting, artificial neural networks, and so on. Search also occurs within algorithms to find the best possible tuning hyperparameters for a given method. Thorough search is a time-intensive process that can result in project delays; software can help, although the cost may be prohibitive.
On the other end of the spectrum is the strategy of quickly developing a model that seeks to outperform the relevant baseline model, which depends on the application. For example, for time series problems, the baseline prediction may be the most recent observation or the historical average over some lookback period. A typical naïve baseline for classification is to always predict the majority class. And for some applications the baseline model is the performance of human experts. If the first step is successful, then the model is tested on acceptance criteria for safety and algorithmic bias. Once a minimally viable model is in place, then work can immediately begin on the implementation phase of the project.
We believe this “better model” strategy is better than the “best model” approach, for two primary reasons. The first is logical: As there is a non-negligible chance that either approach will fail, using the better-model strategy gets you to that point of failure faster, with fewer resources expended to achieve the valuable payoff of information that fuels subsequent attempts. The second requires further discussion: Using more refined models may produce diminishing (and sometimes illusory) performance returns.
The claim that there are diminishing returns to model refinement is both controversial and difficult to evaluate. British statistician David Hand makes several compelling arguments for why we should expect the law of diminishing returns for sophisticated classification methods . In one illustration, he compares a simple classifier to the current best method for a selection of ten real-world prediction problems. For most of these problems, the simple method captured over 90 percent of the improvement in predictive accuracy when compared to the best current method, with improvement defined as the proportion of the error reduced from using the default rule of assigning all cases to the majority class. (Others favor using the ratio of error rates, which gives a more optimistic assessment of the benefit from using complex methods ).
These results may be overly optimistic about the improvement from the best methods. In many practical applications, a model’s training data is drawn from a population distribution that changes over time. For these types of situations, Hand argues, simple models are more likely to generalize to the prediction context since they characterize dominant features of distributions that are more likely to persist. Complex models, in contrast, capture smaller idiosyncrasies that are often less stable across drifting distributions.
We can also look to the healthcare literature, where numerous studies have compared newer, complex ML classifiers with older, simpler approaches such as logistic regression or naïve Bayes. Predicted health outcomes have included acute cardiac ischemia, severe outcomes for pneumonia patients, prediabetes, urinary tract infections, hospital and emergency department visits, heart failure, functional independence of stroke patients, intra-cranial complications, types of ovarian tumors, the risk of major chronic diseases, and surgical readmissions. Most studies compare classifiers using the area under the receiver operating characteristic curve (AUROC), a measure of discrimination performance. Some find small to moderate improvements from using ML [4-7], while others demonstrate no reliable difference, or even an advantage for the simpler methods [8-14].
A recent article reviewed 71 studies that compared logistic regression and ML algorithms and logistic regression used for clinical predictions . For the purposes of their article, “logistic regression” subsumed both standard and regularized logistic regression while “ML algorithms” referred to classification trees, random forests, support vector machines, artificial neural networks, and additional algorithms. The authors distinguished between comparisons that were at either low or high risk of performance estimation bias due to methodological issues concerning model validation and variable selection. On average, performance was no different between ML and logistic regression for the comparisons classified as having low risk for bias. The high-bias comparisons, in contrast, showed an advantage for ML methods.
The above studies focused solely on either accuracy or AUROC, but it is typically a mistake to emphasize a single performance metric when evaluating model performance. Calibration is another pivotal aspect of risk prediction , and newer methods often display an advantage on this dimension . Importantly, the benefits of new ML methods are often easily achieved since many are widely available and provide good out-of-the-box performance using the default tuning parameters. Thus, applying complex ML methods is generally consistent with the “better model” strategy so long as they can be easily and quickly applied.
An attractive compromise between simple and advanced methods is found with regularized regression methods such as ridge regression or the lasso. These models allow for an adaptable penalty for model complexity, which some have found to be the primary benefit of the newer methods, particularly when working with many features and relatively few cases . This option is especially attractive if one suspects population drift . Further, allowing for the nonlinear effects of continuous predictors allows these models to be competitive with more complex ML methods . Since a regularized regression can be expressed as a vector of coefficients, it is easier to interpret and validate, thus facilitating the important work of testing for safety and bias. And since predictions can be generated from a simple equation, it also means that these models can be implemented in virtually any production environment.
Finally, how do you decide when a “better” model is not good enough? For instance, perhaps the first attempt did not reliably improve over the baseline model. Should additional attempts be made? Or perhaps the initial model was successful and is now in production. Should researchers try to improve on that model? To answer these questions, examine the stakes involved. Even small improvements in accuracy can lead to large savings when working with very costly outcomes . The medical imaging example is again illustrative: The potential payoff from creating highly accurate diagnostic algorithms is enormous. At the same time, we know additional improvement is still possible since current AI cannot reliably beat the top human diagnosticians. In this instance, therefore, continuing to work on model development makes sense.
Significant investment in a predictive model is not prudent until you are confident that the model can be successfully implemented. The “better model” strategy emphasizes quickly constructing a model so that more time may be devoted to the other tasks required for a successful healthcare analytics project. Regularized regression methods are an attractive starting point since they avoid overfitting and are relatively easy to interpret and implement. Once they have deployed a model, data scientists can work on honing its performance, incorporating feedback from the end users who are often best positioned to identify points of weakness.
The promise of healthcare analytics is still largely unrealized, yet our optimism remains. We believe that this potential will become more fully realized as we continue to learn and improve upon the steps required to achieve success.
 Passi, Samir, Barocas, Solon (2019). Problem formulation and fairness. FAT* '19: Proceedings of the Conference on Fairness, Accountability, and Transparency. pp. 39-48.
 Hand, David J (2006). Rejoinder: Classifier technology and the illusion of progress. Statist Sci 21(1): 30-34.
 Friedman, Jerome H (2006). Comment: Classifier technology and the illusion of progress. Statist Sci 21(1): 15-18.
 Selker, HP, Griffith, JL, Patil, S, Long, WJ, D'Agostino, RB (1995). A comparison of performance of mathematical predictive methods for medical diagnosis: identifying acute cardiac ischemia among emergency department patients. J Investig Med 43(5): 468-76.
 Cooper, Gregory F, Abraham, Vijoy, Aliferis, Constantin F, Aronis, John M, Buchanan, Bruce G, Caruana, Richard, Fine, Michael J, Janosky, Janine E, Livingston, Gary, Mitchell, Tom, Monti, Stefano, Spirtes, Peter (2005). Predicting dire outcomes of patients with community acquired pneumonia. J Biomed inform 38(5): 347-66.
 Choi, Soo Beom, Kim, Won Jae, Yoo, Tae Keun, Park, Jee Soo, Chung, Jai Won, Lee, Yong-ho, Kang, Eun Seok, Kim, Deok Won (2014). Screening for prediabetes using machine learning models. Comput Math Method M. Vol 2014. Article ID 618976.
 Taylor, R Andrew, Moore, Christopher L, Cheung, Kei-Hoi, Brandt, Cynthia (2018). Predicting urinary tract infections in the emergency department with machine learning. PLOS ONE 13(3): e0194085.
 Jones, Aaron, Costa, Andrew P, Pesevski, Angelina, McNicholas, Paul D (2018). Predicting hospital and emergency department utilization among community-dwelling older adults: statistical and machine learning approaches. PLOS ONE 13(11): e0206662.
 Austin, Peter C, Tu, Jack V, Ho, Jennifer E, Levy, Daniel, Lee, Douglas S (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol 66(4): 398-407.
 König, IR, Malley, JD, Weimar, C, Diener, HC, Ziegler, A (2007). Practical experiences on the necessity of external validation. Stat Med 26(30): 5499-511.
 van der Ploeg, Tjeerd, Smits, Marion, Dippel, Diederik W, Hunink, Myriam, Steyerberg, Ewout W (2011). Prediction of intracranial findings on CT-scans by alternative modelling techniques. BMC Med Res Methodol 11: 143.
 Van Calster, Ben, Valentin, Lil, Van Holsbeke, Caroline, Testa, Antonia C, Bourne, Tom, Van Huffel, Sabine, Timmerman, Dirk (2010). Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: development and validation of standard and kernel-based risk prediction models. BMC Med Res Methodol 10: 96.
 Nusinovici, Simon, Tham, Yih Chung, Chak Yan, Marco Yu, Wei Ting, Daniel Shu, Li, Jialiang, Sabanayagam, Charumathi, Cheng, Wong, Tien Yin, Cheng, Ching-Yu (2020). Logistic regression was as good as machine learning for predicting major chronic diseases. J Clin Epidemiol 122: 56-69.
 Velibor V. Mišić, Gabel, Eilon, Hofer, Ira, Rajaram, Kumar, Mahajan, Aman (2020). Machine learning prediction of postoperative emergency department hospital readmission. Anesthesiology 132: 968–980.
 Christodoulou, Evangelia, Ma, Jie, Collins, Gary S, Steyerberg, Ewout W, Verbakel, Jan Y, Van Calster, Ben (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 110: 12-22.
 Van Calster, Ben, Nieboer, Daan, Vergouwe, Yvonne, De Cock, Bavo, Pencina, Michael J, Steyerberg, Ewout W (2016). A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 74: 167-176.