Some software developers are now letting artificial intelligence help write their code. They’re finding that AI is just as flawed as humans.
Last June, GitHub, a subsidiary of Microsoft that provides tools for hosting and collaborating on code, released a beta version of a program that uses AI to assist programmers. Start typing a command, a database query, or a request to an API, and the program, called Copilot, will guess your intent and write the rest.
Alex Naka, a data scientist at a biotech firm who signed up to test Copilot, says the program can be very helpful, and it has changed the way he works. “It lets me spend less time jumping to the browser to look up API docs or examples on Stack Overflow,” he says. “It does feel a little like my work has shifted from being a generator of code to being a discriminator of it.”
But Naka has found that errors can creep into his code in different ways. “There have been times where I’ve missed some kind of subtle error when I accept one of its proposals,” he says. “And it can be really hard to track this down, perhaps because it seems like it makes errors that have a different flavor than the kind I would make.”
The risks of AI generating faulty code may be surprisingly high. Researchers at NYU recently analyzed code generated by Copilot and found that, for certain tasks where security is crucial, the code contains security flaws around 40 percent of the time.
The figure “is a little bit higher than I would have expected,” says Brendan Dolan-Gavitt, a professor at NYU involved with the analysis. “But the way Copilot was trained wasn’t actually to write good code—it was just to produce the kind of text that would follow a given prompt.”
Despite such flaws, Copilot and similar AI-powered tools may herald a sea change in the way software developers write code. There’s growing interest in using AI to help automate more mundane work. But Copilot also highlights some of the pitfalls of today’s AI techniques.
While analyzing the code made available for a Copilot plugin, Dolan-Gavitt found that it included a list of restricted phrases. These were apparently introduced to prevent the system from blurting out offensive messages or copying well-known code written by someone else.
Oege de Moor, vice president of research at GitHub and one of the developers of Copilot, says security has been a concern from the start. He says the percentage of flawed code cited by the NYU researchers is only relevant for a subset of code where security flaws are more likely.
De Moor invented CodeQL, a tool used by the NYU researchers that automatically identifies bugs in code. He says GitHub recommends that developers use Copilot together with CodeQL to ensure their work is safe.
The GitHub program is built on top of an AI model developed by OpenAI, a prominent AI company doing cutting-edge work in machine learning. That model, called Codex, consists of a large artificial neural network trained to predict the next characters in both text and computer code. The algorithm ingested billions of lines of code stored on GitHub—not all of it perfect—in order to learn how to write code.
OpenAI has built its own AI coding tool on top of Codex that can perform some stunning coding tricks. It can turn a typed instruction, such as “Create an array of random variables between 1 and 100 and then return the largest of them,” into working code in several programming languages.
Another version of the same OpenAI program, called GPT-3, can generate coherent text on a given subject, but it can also regurgitate offensive or biased language learned from the darker corners of the web.
Copilot and Codex have led some developers to wonder if AI might automate them out of work. In fact, as Naka’s experience shows, developers need considerable skill to use the program, as they often must vet or tweak its suggestions.