From collaboration to validation: How a team of engineers, clinicians, programmers, and administrators evaluated one of the most widely implemented predictive models

Feature Stories

From collaboration to validation: How a team of engineers, clinicians, programmers, and administrators evaluated one of the most widely implemented predictive models

In spring 2020—when the COVID pandemic took root, hospitals were overrun, resources were spread thin, and beds were scarce—doctors were using any tools at their disposal that could aid in clinical decision making. One such tool is a “deterioration index”: a predictive model that uses vital signs and other information to predict which patients are likely to deteriorate quickly and require a transfer to the intensive care unit (ICU).

Michigan Medicine was already running this type of predictive model— the proprietary Epic Deterioration Index (EDI)—before COVID hit. Dozens of other hospital networks were also running EDI and using it to inform decisions in COVID care. But no one really knew how effective EDI scores were in the context of COVID, and finding out was of vital importance.

A team of U-M researchers in operations, engineering, and clinical care—many of them Precision Health members—convened to study how well EDI actually worked in predicting which COVID patients were at highest risk for deteriorating and requiring ICU-level care, and which were at the lowest risk and, if necessary, could be safely transported to another location, such as a field hospital.

The results were published in the Annals of the American Thoracic Society (“Evaluating a Widely Implemented Proprietary Deterioration Index Model among Hospitalized COVID-19 Patients“).

As co-author Shengpu Tang, a PhD Candidate in Computer Science & Engineering, explains, “EDI is widely implemented in US hospitals, and clinicians are relying on these proprietary models to support their decision making, especially amidst the COVID-19 pandemic. However, the proprietary nature of such models makes it not as transparent as a model we develop ourselves.”

Tom Valley, MD, MSc, an Assistant Professor of Pulmonary and Critical Care Medicine at U-M, and a co-author, stresses the potential value of a deterioration index: “In the ICU sphere, scores like these are really the Holy Grail; we want something that can tell us who’s sick and who’s not, and who needs the ICU and who doesn’t. Everybody wants something like this that works.”

But whether and how well these scores work is unclear. The research that led to this paper is a first, says lead author Karandeep Singh, MD, MMSc, Assistant Professor of Learning Health Sciences, Internal Medicine, Urology, and Information at U-M. “There’s been no validation ever published for EDI for any situation,” he explains.

“Every deterioration index is trying to get at which patients are about to get sicker, such that we might be able to intervene, and either prevent them from getting sicker, or transfer them to the ICU sooner, in a more stable fashion, rather than wait for something really bad to happen,” says Singh, adding, “I wouldn’t say any one model has achieved this.”

EDI: widely implemented, not widely studied

Hundreds of health systems use Epic to manage Electronic Health Records (EHR), and many of these, including U-M, had already implemented EDI before COVID hit. “It was so broadly implemented at that time, across hospitals, it was a natural fit to look at,” Singh says.

The fact that EDI was implemented at so many places when the pandemic began “is sort of amazing,” says Erkin Otles, a Medical Scientist Training Program Fellow (MD-PhD student) in Industrial & Operations Engineering at U-M, another author on the paper. “The level of work it would take to get a new model to work at all these places would be astronomical.” For all the limitations of EDI, its ubiquity across hospitals meant that comparable standards/scoring methods were being used.

“In the absence of peer review,” however, “no one really knows what any of these numbers mean,” Singh says. “It was an urgency to figure out if people were using it [EDI] appropriately, and how well it actually worked.”

Not a single peer-reviewed paper had been published about how EDI fares in any particular population, let alone the COVID population. Hospitals were employing the EDI model before having “official evidence” of its efficacy pertaining to COVID, Singh says. “For these proprietary models, where there is not very much published, they end up being widely adopted, widely used, before they’re widely understood,” he adds.

“The incentives in the way that they make [proprietary models] is not necessarily aligned with an academic or scientific endeavor; that leaves open this weird space where the model gets used and it doesn’t necessarily have the right systems to examine it,” says Otles.

The need for a predictive model

In spring 2020, when rising COVID cases were exceeding hospitals’ capacity, the operations team at Michigan Medicine was exploring options to meet the anticipated need of more beds, fast. Plans were in place to construct a field hospital that could house patients at very low risk of developing COVID complications and needing ICU care (in the end, a field hospital was not required). Michigan Medicine leadership asked researchers to evaluate EDI’s efficacy as an early warning system, says Singh, to establish which patients could safely go to a field hospital.

On the other end of the spectrum, doctors also needed a model to help identify declining patients—many patients hospitalized with COVID required ICU-level care, and the decline that precipitated a move to the ICU came on rapidly.

These are the needs that drew Valley to the project of evaluating EDI. “There was such a need for some sort of objectivity to identify which patients were at high risk, which patients were at low risk,” he says. “We knew very little about COVID; we knew very little about managing patients with COVID, but we had all these different tools that should theoretically be able to identify high- and low-risk patients….When health systems have tools like Epic, when they’re investing large amounts of money into tools and technologies like that, you would hope that the tools available are useful in these circumstances.” And COVID put EDI to the test. “It’s really the first time,” says Valley, “that anyone’s tried to see whether these tools work or not, without taking these scores on face value, but actually seeing how well they actually work in an independent study.”

Trying to identify low-risk patients as well as high-risk ones “is really the flipside of most tools; most tools that are available try to find people at high risk of death; here we were trying to find people at low risk of death, who could safely be transferred to less acute situations, like a field hospital,” Valley says. “I think that’s a really original use of the tool.”

“We were uniquely positioned to look at this issue because we had activated the Epic Deterioration Index in the months leading up to the pandemic, so it was actually running,” says Singh. “It’s not a decision we could have made halfway into the pandemic. We would have had to have that score running already. That’s the only way you can look at a proprietary score, is to have it running, which we did.”

EDI was a model untested on COVID, but it was a deterioration index already in use at U-M, and in the unprecedented, early days of the pandemic, hospitals needed to employ whatever resources they had.

As Valley puts it: “I think our work is important to keep in the context of COVID and the clinical, social, and environmental context of when we studied it. It was absolute chaos. To be able to extend that to other environments and other times is difficult.”

Singh agrees. “We shouldn’t drive decisions purely based on this one score, but I think in a catastrophe/disaster scenario, the other alternatives also didn’t look very appealing at the time,” he says. “When we thought we were going to have 3,000 patients who might need to be hospitalized, when our hospital can only accommodate 1,000 patients, we were really thinking, Do we have the workforce to lay eyes on all 3,000 patients and take really excellent care of them? And if we don’t, what other information can guide decision making, in triage? That’s the mindset we were at,” Singh explains.

U-M’s collaborative approach

The need to test and validate the EDI model for COVID was urgent, but the open lines of communications among the various stakeholders enabled a quick and coordinated response.

“In most institutions, the people tasked with overseeing models on the operational side are completely disconnected from the researchers who evaluate prediction models,” says Singh. “At Michigan Medicine, we are uniquely positioned to be able to do this kind of work because we have a research community of folks in prediction modeling [the Michigan Integrated Center for Health Analytics and Medical Prediction, or MiCHAMP],” as well as an Artificial Intelligence (AI) Lab. “We also have a translational arm, which is Precision Health, that can bring these pieces together,” Singh explains. The Clinical Intelligence Committee (CIC), of which Singh is a co-chair, “is essentially our operational arm for prediction models.” “All of these are connected,” says Singh, which has “allowed us to do this work in a rapid fashion.”

Jenna Wiens, PhD—Associate Professor of Computer Science and Engineering, Precision Health Co-Director, and study author—also credits this collaborative approach in expediting research: “Asking the right question is critical. This requires the right mix of expertise but also a dialogue, since rarely do you get it right the first time. The sense of community we have built in the AI Lab, MiCHAMP, and Precision Health has made it easier to rapidly iterate.”

The results

How did EDI fare in predicting which COVID patients were at low risk and high risk of decline? It has its usefulness, but is by no means a silver bullet.

One positive impact of the research has been to streamline how risk thresholds are displayed to care providers. “We use a color scale to emphasize some of the differences,” which is more “visually helpful” than displaying a number, says Singh. “How we communicate that score visually, to end users, is based on thresholds that we came to during this study.”

In the end, EDI is “pretty good at sorting people by their risk overall,” Singh says, “but when you get into specific thresholds in finding who needs ICU-level care, or who’s never going to need ICU-level care, it misses enough people both ways, that you wouldn’t want to solely rely on this score to make those decisions.”

He adds, “On the lower-risk spectrum, its negative predictive value is 90%. In other words, 10% of the people that you would send to a field hospital, if you were to solely follow the model, would ultimately need ICU-level care…. And I think the question is, Is that really acceptable?”

“If you’re going to use the score, it needs to be in the context of clinical judgment or some other decision making,” says Otles. Can you use EDI to figure out “who’s going to go where”? asks Otles, and the answer, from the paper, is “probably not, for the COVID population.”

The next step: MCURES

While a team of researchers was evaluating EDI, some members of that team were simultaneously working on a COVID predictive model developed specifically for the Michigan Medicine population: the Michigan COVID-19 Utilization and Risk Evaluation System, or MCURES.

MCURES is also a deterioration index, and it came out of the same urgent needs created by COVID, says Otles. “When we were getting the initial evaluation back that Epic DI had these performance characteristics, we thought, Maybe we need to have something that’s going to work with our COVID patients…. And that’s where MCURES really started to take hold,” he says.  Evaluating how MCURES performs, and comparing MCURES’ performance to that of EDI is “part of our model creation process,” Otles continues; “Epic DI creates a solid baseline” for MCURES work. EDI also serves as a benchmark for how to prospectively validate MCURES. “They’re very related to one another,” Otles says, but “MCURES is much more tailored to the specific problem of COVID.”

“EDI serves as a baseline on which we aim to improve,” says Wiens. “By tailoring the model to not only a population, but a specific use case, we can improve predictions and, in turn, patient care.”

“One of the things that Michigan is very innovative in,” says Otles, is having the expertise to both make and integrate models. “It’s allowed for a unique environment where we can start to talk about the models that are given to us by Epic, and also talk about the models that we’re building, and use both sets of information to learn from one another.”

“Every study we’re doing in deterioration indices now has the Epic score as a benchmark,” says Singh, adding, “The collaborative work that happened set us up to be able to evolve our thinking as an institution about how we’re going to identify deteriorating patients at Michigan Medicine so we can provide the best care to the sickest patients.”

“The MCURES timeline was incredibly accelerated relative to past projects,” says Wiens. “Previously, I would not have believed such progress was possible in such a short time frame. The team worked together to make it happen. We were able to build on a lot of prior experience and resources at U-M.”

More studies needed

One definite outcome of the EDI research was the recognition that more such studies need to be done. More studies like this need to be done.

“What the pandemic brought out of the woodwork was that all of these of these models that we’re all using, but that have very little visibility in the research world, were widely in use,” says Singh. “Proprietary models are much more common than we would like to admit, and they’ve primarily been separated from the research literature because it’s a different set of people who have been looking at them.” The research into EDI “brought closer scrutiny to the presence and widespread use of proprietary prediction models, not just in COVID, but in medical care more generally.”

Valley says, “Studies like ours are really needed for things like this, particularly when these are tools that are implemented so widely. This is a risk-prediction tool that is just ubiquitous. To be able to say that it works everywhere is an incredibly large thing to say, particularly when the outcomes might vary so much from place to place. There’s a clear need for more studies like this.”

“It’s a paradox,” says Otles. “We have all of these widely implemented models that were developed in a proprietary manner, but they’re widely deployed, not well studied….We have a lot of models that are open—well studied, well understood in the research literature—and they’re not implemented.” This research may have provided the opportunity, he says, to get past the paradox.