in

Innovation Helps AI Models Forget Copyrighted and Sensitive Information

A team of computer scientists at the University of California, Riverside (UCR) has made a remarkable breakthrough in artificial intelligence (AI): they’ve developed a way to make AI models “forget” specific data even without having access to the original training dataset.

This is a game-changer in the field of AI, where deleting sensitive or copyrighted information has been one of the biggest challenges. The researchers presented their work in July at the International Conference on Machine Learning (ICML) in Vancouver, Canada—one of the most respected conferences in the world of AI and machine learning. Their study is also published on the popular arXiv preprint server, making it accessible to experts and researchers around the globe.

The Challenge of Data in AI Models

Modern AI models, like ChatGPT and other large language models (LLMs), are trained on huge datasets. These datasets often include:

  • Books and articles (many of them copyrighted),
  • Sensitive personal data (sometimes unintentionally collected),
  • Paywalled content that creators didn’t intend to make public.

Once a model is trained, this data becomes part of the model’s “knowledge.” The AI doesn’t store copies of documents but learns patterns from them, allowing it to recreate text, code, or ideas—sometimes even word-for-word.

Here’s the problem:

  • If someone wants their personal data removed, or a company wants to remove copyrighted content, it’s nearly impossible to delete it without retraining the entire AI model.
  • Retraining requires enormous amounts of money, time, and computing power.
  • In some cases, the original training data is simply unavailable or too large to manage.

This issue is at the heart of many current debates about AI ethics, privacy, and copyright. For example:

  • The New York Times is suing OpenAI and Microsoft over using its articles to train AI models without permission.
  • Privacy laws like the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the U.S. demand companies give users the “right to be forgotten” and allow deletion of their personal data.

Until now, there has been no practical, scalable way to do this in AI models.

The UCR Solution: “Source-Free Certified Unlearning”

To solve this problem, the UCR research team Ümit Yiğit Başaran (doctoral student), Professor Amit Roy-Chowdhury, and Assistant Professor Başak Güler developed a novel approach they call source-free certified unlearning.

Let’s break it down step-by-step:

  1. No Original Data Required
    Most existing “unlearning” methods need the original training data. The UCR method does not. This is critical because companies often lose access to data after training or cannot store massive datasets long-term.
  2. Use of Surrogate Datasets
    Instead of the original data, the method uses a smaller surrogate dataset that mimics the statistical patterns of the original data. Think of it as a “stand-in” that helps the model adjust without having to retrain from scratch.
  3. Targeted Forgetting
    The AI model’s internal parameters (the mathematical weights and structures that determine how it “thinks”) are carefully adjusted. Controlled random noise is added to ensure that unwanted information is wiped out and cannot be reconstructed.
  4. Proven Privacy Guarantees
    Tests on synthetic (simulated) and real-world datasets show that this method achieves privacy protection close to full retraining, but at a fraction of the computational cost.

As Başaran explains:

“In real-world situations, you can’t always go back and get the original data. We’ve created a certified framework that works even when that data is no longer available.”

Why It’s Practical and Scalable

The brilliance of this approach lies in its efficiency. Instead of retraining an entire model, the UCR team used advanced mathematical optimization techniques to simulate what retraining would have done—saving time and computing power.

They further refined the process with a noise-calibration system that compensates for differences between the surrogate dataset and the original one. This ensures accuracy and reliability, even though the original data is missing.

Currently, this technique is best suited for simpler AI models, which are still widely used in industries like healthcare, journalism, and finance. But the researchers are confident they can scale it up to handle large-scale systems like ChatGPT.

According to Professor Roy-Chowdhury, co-director of UCR’s Riverside Artificial Intelligence Research and Education (RAISE) Institute:

“This breakthrough has the potential to make AI safer and more responsible by giving organizations a reliable way to erase sensitive data.”

Assistant Professor Güler highlighted its societal impact:

“People deserve to know their data can be erased from machine learning models not just in theory, but in provable, practical ways.”

What’s Next for This Technology

The team is already planning to:

  • Adapt the technique for larger models with billions of parameters.
  • Create developer-friendly tools so AI engineers worldwide can use this approach.
  • Help organizations meet global privacy and copyright regulations, giving individuals greater control over their digital footprint.

Their paper, “A Certified Unlearning Approach without Access to Source Data,” was co-authored with Sk Miraj Ahmed, a computational science researcher at Brookhaven National Laboratory, who earned his Ph.D. from UCR. Both Roy-Chowdhury and Güler are faculty members in UCR’s Department of Electrical and Computer Engineering, with additional roles in the Department of Computer Science and Engineering.

Why This Breakthrough Is a Big Deal

This research is a step toward ethical AI—where technology respects privacy, ownership, and fairness. Here’s why it matters:

  • For Individuals: It gives people power to request deletion of their personal data from AI models.
  • For Companies: It helps organizations meet legal requirements like GDPR without breaking the bank.
  • For Society: It opens the door for “machine forgetting,” an essential concept as AI becomes part of healthcare, education, journalism, and more.

Instead of fearing that sensitive or copyrighted information will live forever inside AI models, this breakthrough proves that AI can forget and do so responsibly.

As AI continues to shape our daily lives, this kind of innovation builds trust and lays the foundation for a future where data ownership and privacy are protected, not compromised.

Website |  + posts

What do you think?

Written by Vivek Raman

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Rondo VR Project: Bringing History Back to Life

The New Linguistic Imperialism: AI’s Role in the Global Information Divide