Why We Need New Benchmarks for AI

How do you measure artificial intelligence?

Since the idea first took hold in the 1950s, researchers have gauged the progress of AI by establishing benchmarks, such as the ability to recognize images, create sentences and play games like chess. These benchmarks have proved a useful way to determine whether AI is better able to do more things—and to drive researchers toward creating AI tools that are even more useful.

In the past few years, AI systems have surpassed many of the tests researchers have proposed, beating humans at many tasks. For researchers, the mission now is to create benchmarks that could capture the broader kinds of intelligence that could make AI truly useful—benchmarks, for instance, that can reflect elusive skills such as reasoning, creativity and the ability to learn. Not to mention areas like emotional intelligence that are hard enough to measure in humans.

An AI system, for instance, can perform well enough that humans can’t always tell whether, say, an image or a paragraph was created by a human or a machine. Or ask an AI system who won the Oscar for best actress last year and it would have no problem. But ask why the actress won, and the AI would be stumped. It would lack the reasoning, the contextualizing, the emotional understanding that is needed to adequately answer.

“We’ve done the easy part,” says Jack Clark, co-chair of the AI Index, a Stanford University report that tracks AI development. “The big question is, what do really ambitious benchmarks look like in the future, and what do they measure?”

This post first appeared on wsj.com