The current boom in artificial intelligence can be traced back to 2012 and a breakthrough during a competition built around ImageNet, a set of 14 million labeled images.
In the competition, a method called deep learning, which involves feeding examples to a giant simulated neural network, proved dramatically better at identifying objects in images than other approaches. That kick-started interest in using AI to solve different problems.
But research revealed this week shows that ImageNet and nine other key AI data sets contain many errors. Researchers at MIT compared how an AI algorithm trained on the data interprets an image with the label that was applied to it. If, for instance, an algorithm decides that an image is 70 percent likely to be a cat but the label says “spoon,” then it’s likely that the image is wrongly labeled and actually shows a cat. To check, where the algorithm and the label disagreed, researchers showed the image to more people.
ImageNet and other big data sets are key to how AI systems, including those used in self-driving cars, medical imaging devices, and credit-scoring systems, are built and tested. But they can also be a weak link. The data is typically collected and labeled by low-paid workers, and research is piling up about the problems this method introduces.
Algorithms can exhibit bias in recognizing faces, for example, if they are trained on data that is overwhelmingly white and male. Labelers can also introduce biases if, for example, they decide that women shown in medical settings are more likely to be “nurses” while men are more likely to be “doctors.”
Recent research has also highlighted how basic errors lurking in the data used to train and test AI models—the predictions produced by an algorithm—may disguise how good or bad those models really are.
“What this work is telling the world is that you need to clean the errors out,” says Curtis Northcutt, a PhD student at MIT who led the new work. “Otherwise the models that you think are the best for your real-world business problem could actually be wrong.”
Aleksander Madry, a professor at MIT, led another effort to identify problems in image data sets last year and was not involved with the new work. He says it highlights an important problem, although he says the methodology needs to be studied carefully to determine if errors are as prevalent as the new work suggests.
Similar big data sets are used to develop algorithms for various industrial uses of AI. Millions of annotated images of road scenes, for example, are fed to algorithms that help autonomous vehicles perceive obstacles on the road. Vast collections of labeled medical records also help algorithms predict a person’s likelihood of developing a particular disease.
Such errors might lead machine learning engineers down the wrong path when choosing among different AI models. “They might actually choose the model that has worse performance in the real world,” Northcutt says.
Northcutt points to the algorithms used to identify objects on the road in front of self-driving cars as an example of a critical system that might not perform as well as its developers think.
It is hardly surprising that AI data sets contain errors, given that annotations and labels are typically applied by low-paid crowd workers. This is something of an open secret in AI research, but few researchers have tried to pinpoint the frequency of such errors. Nor has the effect on the performance of different AI models been shown.
The MIT researchers examined the ImageNet test data set—the subset of images used to test a trained algorithm—and found incorrect labels on 6 percent of the images. They found a similar proportion of errors in data sets used to train AI programs to gauge how positive or negative movie reviews are, how many stars a product review will receive, or what a video shows, among others.
These AI data sets have been used to train algorithms and measure progress in areas including computer vision and natural language understanding. The work shows that the presence of these errors in the test data set makes it difficult to gauge how good one algorithm is compared with another. For instance, an algorithm designed to spot pedestrians might perform worse when incorrect labels are removed. That might not seem like much, but it could have big consequences for the performance of an autonomous vehicle.
After a period of intense hype following the 2012 ImageNet breakthrough, it has become increasingly clear that modern AI algorithms may suffer from problems as a result of the data they are fed. Some say the whole concept of data labeling is problematic too. “At the heart of supervised learning, especially in vision, lies this fuzzy idea of a label,” says Vinay Prabhu, a machine learning researcher who works for the company UnifyID.
Last June, Prabhu and Abeba Birhane, a PhD student at University College Dublin, combed through ImageNet and found errors, abusive language, and personally identifying information.
Prabhu points out that labels often cannot fully describe an image that contains multiple objects, for example. He also says it is problematic if labelers can add judgments about a person’s profession, nationality, or character, as was the case with ImageNet.