You’re Probably Underestimating AI Chatbots

In the spring of 2007, I was one of four journalists anointed by Steve Jobs to review the iPhone. This was probably the most anticipated product in the history of tech. What would it be like? Was it a turning point for devices? Looking back at my review today, I am relieved to say it’s not an embarrassment: I recognized the device’s generational significance. But for all the praise I bestowed upon the iPhone, I failed to anticipate its mind-blowing secondary effects, such as the volcanic melding of hardware, operating system, and apps, or its hypnotic effect on our attention. (I did urge Apple to “encourage outside developers to create new uses” for the device.) Nor did I suggest we should expect the rise of services like Uber or TikTok or make any prediction that family dinners would turn into communal display-centric trances. Of course, my primary job was to help people decide whether to spend $500, which was super expensive for a phone back then, to buy the damn thing. But reading the review now, one might wonder why I spent time griping about AT&T’s network or the web browser’s inability to handle Flash content. That’s like quibbling over what sandals to wear just as a three-story tsunami is about to break.

I am reminded of my failure of foresight when reading about the experiences people are having with recent AI apps, like large language model chatbots and AI image generators. Quite rightfully, people are obsessing about the impact of a sudden cavalcade of shockingly capable AI systems, though scientists often note that these seemingly rapid breakthroughs have been decades in the making. But as when I first pawed the iPhone in 2007, we risk failing to anticipate the potential trajectories of our AI-infused future by focusing too much on the current versions of products like Microsoft’s Bing chat, OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Bard.

This fallacy can be clearly observed in what has become a new and popular media genre, best described as prompt-and-pronounce. The modus operandi is to attempt some task formerly limited to humans and then, often disregarding the caveats provided by the inventors, take it to an extreme. The great sports journalist Red Smith once said that writing a column is easy—you just open a vein and bleed. But would-be pundits now promote a bloodless version: You just open a browser and prompt. (Note: this newsletter was produced the old-fashioned way, by opening a vein.)

Typically, prompt-and-pronounce columns involve sitting down with one of these way-early systems and seeing how well it replaces something previously limited to the realm of the human. In a typical example, a New York Times reporter used ChatGPT to answer all her work communications for an entire week. The Wall Street Journal’s product reviewer decided to clone her voice (hey, we did that first!) and appearance using AI to see if her algorithmic doppelgängers could trick people into mistaking the fake for the real thing. There are dozens of similar examples.

Generally, those who stage such stunts come to two conclusions: These models are amazing, but they fall miserably short of what humans do best. The emails fail to pick up workplace nuances. The clones have one foot dragging in the uncanny valley. Most damningly, these text generators make things up when asked for factual information, a phenomenon known as “hallucinations”’ that is the current bane of AI. And it’s a plain fact that the output of today’s models often have a soulless quality.

In one sense, it’s scary—will our future world be run by flawed “mind children,” as roboticist Hans Moravec calls our digital successors? But in another sense, the shortcomings are comforting. Sure, AIs can now perform a lot of low-level tasks and are unparalleled at suggesting plausible-looking Disneyland trips and gluten-free dinner party menus, but—the thinking goes—the bots will always need us to make corrections and jazz up the prose.