Companies of all kinds use machine learning to analyze people’s desires, dislikes, or faces. Some researchers are now asking a different question: How can we make machines forget?
A nascent area of computer science dubbed machine unlearning seeks ways to induce selective amnesia in artificial intelligence software. The goal is to remove all trace of a particular person or data point from a machine learning system, without affecting its performance.
If made practical, the concept could give people more control over their data and the value derived from it. Although users can already ask some companies to delete personal data, they are generally in the dark about what algorithms their information helped tune or train. Machine unlearning could make it possible for a person to withdraw both their data and a company’s ability to profit from it.
Although intuitive to anyone who has rued what they shared online, that notion of artificial amnesia requires some new ideas in computer science. Companies spend millions of dollars training machine-learning algorithms to recognize faces or rank social posts, because the algorithms often can solve a problem more quickly than human coders alone. But once trained, a machine-learning system is not easily altered, or even understood. The conventional way to remove the influence of a particular data point is to rebuild a system from the beginning, a potentially costly exercise. “This research aims to find some middle ground,” says Aaron Roth, a professor at the University of Pennsylvania who is working on machine unlearning. “Can we remove all influence of someone’s data when they ask to delete it, but avoid the full cost of retraining from scratch?”
Work on machine unlearning is motivated in part by growing attention to the ways artificial intelligence can erode privacy. Data regulators around the world have long had the power to force companies to delete ill-gotten information. Citizens of some locales, like the EU and California, even have the right to request that a company delete their data if they have a change of heart about what they disclosed. More recently, US and European regulators have said the owners of AI systems must sometimes go a step further: deleting a system that was trained on sensitive data.
Last year, the UK’s data regulator warned companies that some machine-learning software could be subject to GDPR rights such as data deletion, because an AI system can contain personal data. Security researchers have shown that algorithms can sometimes be forced to leak sensitive data used in their creation. Early this year, the US Federal Trade Commission forced facial recognition startup Paravision to delete a collection of improperly obtained face photos and machine-learning algorithms trained with them. FTC commissioner Rohit Chopra praised that new enforcement tactic as a way to force a company breaching data rules to “forfeit the fruits of its deception.”
The small field of machine unlearning research grapples with some of the practical and mathematical questions raised by those regulatory shifts. Researchers have shown they can make machine-learning algorithms forget under certain conditions, but the technique is not yet ready for prime time. “As is common for a young field, there’s a gap between what this area aspires to do and what we know how to do now,” says Roth.
One promising approach proposed in 2019 by researchers from the universities of Toronto and Wisconsin-Madison involves segregating the source data for a new machine-learning project into multiple pieces. Each is then processed separately, before the results are combined into the final machine-learning model. If one data point later needs to be forgotten, only a fraction of the original input data needs to be reprocessed. The approach was shown to work on data of online purchases and a collection of more than a million photos.
Roth and collaborators from Penn, Harvard, and Stanford recently demonstrated a flaw in that approach, showing that the unlearning system would break down if submitted deletion requests came in a particular sequence, either through chance or from a malicious actor. They also showed how the problem could be mitigated.
Gautam Kamath, a professor at the University of Waterloo also working on unlearning, says the problem that project found and fixed is an example of the many open questions remaining about how to make machine unlearning more than just a lab curiosity. His own research group has been exploring how much a system’s accuracy is reduced by making it successively unlearn multiple data points.
Kamath is also interested in finding ways for a company to prove—or a regulator to check—that a system really has forgotten what it was supposed to unlearn. “It feels like it’s a little way down the road, but maybe they’ll eventually have auditors for this sort of thing,” he says.