Machine Learning Is Causing a ‘Reproducibility Crisis’ in Science

Kapoor and Narayanan organized a workshop late last month to draw attention to what they call a “reproducibility crisis” in science that makes use of machine learning. They were hoping for 30 or so attendees but received registrations from over 1,500 people, a surprise that they say suggests issues with machine learning in science are widespread.

During the event, invited speakers recounted numerous examples of situations where AI had been misused, from fields including medicine and social science. Michael Roberts, a senior research associate at Cambridge University, discussed problems with dozens of papers claiming to use machine learning to fight Covid-19, including cases where data was skewed because it came from a variety of different imaging machines. Jessica Hullman, an associate professor at Northwestern University, compared problems with studies using machine learning to the phenomenon of major results in psychology proving impossible to replicate. In both cases, Hullman says, researchers are prone to using too little data, and misreading the statistical significance of results.

Momin Malik, a data scientist at the Mayo Clinic, was invited to speak about his own work tracking down problematic uses of machine learning in science. Besides common errors in implementation of the technique, he says, researchers sometimes apply machine learning when it is the wrong tool for the job.

Malik points to a prominent example of machine learning producing misleading results: Google Flu Trends, a tool developed by the search company in 2008 that aimed to use machine learning to identify flu outbreaks more quickly from logs of search queries typed by web users. Google won positive publicity for the project, but it failed spectacularly to predict the course of the 2013 flu season. An independent study would later conclude that the model had latched onto seasonal terms that have nothing to do with the prevalence of influenza. “You couldn’t just throw it all into a big machine-learning model and see what comes out,” Malik says.

Some workshop attendees say it may not be possible for all scientists to become masters in machine learning, especially given the complexity of some of the issues highlighted. Amy Winecoff, a data scientist at Princeton’s Center for Information Technology Policy, says that while it is important for scientists to learn good software engineering principles, master statistical techniques, and put time into maintaining data sets, this shouldn’t come at the expense of domain knowledge. “We do not, for example, want schizophrenia researchers knowing a lot about software engineering,” she says, but little about the causes of the disorder. Winecoff suggests more collaboration between scientists and computer scientists could help strike the right balance.

While misuse of machine learning in science is a problem in itself, it can also be seen as an indicator that similar issues are likely common in corporate or government AI projects that are less open to outside scrutiny.

Malik says he is most worried about the prospect of misapplied AI algorithms causing real-world consequences, such as unfairly denying someone medical care or unjustly advising against parole. “The general lesson is that it is not appropriate to approach everything with machine learning,” he says. “Despite the rhetoric, the hype, the successes and hopes, it is a limited approach.”

Kapoor of Princeton says it is vital that scientific communities start thinking about the issue. “Machine-learning-based science is still in its infancy,” he says. “But this is urgent—it can have really harmful, long-term consequences.”