[Access original publication on EDPB website]

SUPPORT POOL OF EXPERTS PROGRAMME

AI-Complex Algorithms and effective Data Protection Supervision - Effective implementation of data subjects’ rights by Dr. Kris SHRISHAK

INTRODUCTION

The General data Protection Regulation (GDPR) empowers data subjects through a range of rights. A data subject has the right to information (Articles 12-14), the right of access (Article 15), the right to rectification (Article 16), the right to erasure (Article 17), the right to restrict processing (Article 18), the right to data portability (Article 20), the right to object (Article 21) and the right not to be subject to a decision based solely on automated processing (Article 22).

This report covers techniques and methods that can be used for effective implementation of data subject rights, specifically, the rights to rectification and the right to erasure when AI systems have been developed with personal data. This report addresses these rights together because rectification involves erasure followed by the inclusion of new data. These techniques and methods are the result of early-stage research by the academic community. Improvements and alternative approaches are expected to be developed in the coming years.

1 CHALLENGES

AI systems are trained on data that is often memorised by the models (Carlini et al., 2021). Machine learning models behave like lossy compressors of training data and the performance of these models based on deep learning is further attributed to this behaviour (Schelter, 2020; Tishby & Zaslavsky, 2015). In other words, machine learning models are compressed versions of the training data. Additionally, AI models are also susceptible to membership inference attacks that help to assess whether data about a person is in the training dataset (Shokri et al., 2017). Thus, implementing the right to erasure and rectification requires reversing the memorisation of personal data by the model. This involves deletion of (1) the personal data used as input for training, and (2) the influence of the specific data points in the trained model.

There are several challenges to effectively implement these rights (Bourtoule et al., 2021):

Limited understanding of how each data point impacts the model: This challenge is particularly prevalent with the use of deep neural It is not known how specific input data points impact the parameters of a model. The best known methods rely on “influence functions” involving expensive estimations (by computing second-order derivatives of the training algorithm) (Cook & Weisberg, 1980; Koh & Liang, 2017).
Stochasticity of training: Training AI models is usually performed by random sampling of batches of data from the dataset, random ordering of the batches in how and when they are processed, and parallelisation without time-synchronisation. All these make the training process probabilistic. As a result, a model trained with the same algorithm and dataset could result in different trained models (Jagielski et al., 2023).
Incremental training process: Models are trained incrementally such that an update relying on specific training data point will affect all subsequent updates. In other words, updates in the training process depend on all previous updates. In the distributed training setting of federated learning, multiple clients keep their data and train a model locally before sending the updates to a central In such a setting, even when a client only once sends its update and contributes to the global model at the central server, the data and the contribution of this client influences all future updates to the global model.
Stochasticity of learning: In addition to the training process, the learning algorithm is also probabilistic. The choice of the optimiser, for example, for neural networks can result in many different local minima (result of the optimisation). This makes it difficult to correlate how a specific data point contributed to the “learning” in the

2 HOW TO DELETE AND UNLEARN

Data Curation and Provenance: Essential elements to implement the rights in Articles 15-17 of GDPR are data curation and provenance. However, these are necessary but not sufficient for implementing these rights completely as they do not include information related to how the data influenced the trained model. These are prerequisites for the other approaches in this
Retraining of models: Deleting the model, removing the personal data requested to be erased, and then retraining the model with the rest of the data is the method that implements the rights in Articles 16-17 of the GDPR For small models, this method works well. However, for larger models, the training cost is very expensive and often alternative approaches might be required, especially when numerous deletion requests are expected. Furthermore, this approach, and many of the other approaches, assumes that the model developer is in possession of the training datasets when the requirement to delete and retrain arises.
Exact unlearning: To avoid retraining the entire model, approaches to unlearn the data have been proposed. Despite the growing literature, there are very few unlearning methods that are currently most likely to be
Model agnostic unlearning: This method is not dependent on the specific machine learning technique. It is the only approach which has been shown to work for deep neural This approach either (1) relies on storing model gradients (Wu et al., 2020), or (2) relies on the measurement of sensitivity of model parameters to changes in datasets used in federated learning (Tao et al., 2024), or (3) modifies the learning process to be more conducive to unlearning (Bourtoule et al., 2021).

The latter, known as SISA (Sharded, Isolated, Sliced, and Aggregated), is currently the best-known approach. It involves modifying the training process, but is independent of specific learning algorithms (Bourtoule et al., 2021). This approach presets the order in which the learning algorithm is queried to ease the unlearning process. The approach can be described as follows:
1. The training dataset is divided into multiple “shards” such that each training data point is present in only one “shard”. This allows for a non-overlapping partition of the dataset. It is also possible to further “slice” the “shards” so that the training is more modular and deletion is eased
2. The model is then trained on each of these shards or slices. This limits the influence of the data points to these specific shards or
3. When a request for erasure or rectification arrives, unlearning is performed, not by retraining the entire model, but by retraining only the shard or slice that had included the “delete requested”
  
  This method is flexible. For instance, the shards can be chosen such that the most likely “delete request” data are in one shard. Then, fewer shards will need to be retrained, assuming that personal data and non-personal data are separated as part of data curation.
Model intrinsic unlearning: These methods are developed for specific AI techniques. For instance, the methods that are suitable for decision trees and random forests have been shown to be effective (Brophy & Lowd, 2021) by using a new approach to develop decision trees and then relying on strategic thresholding at decision nodes for continuous attributes, and at high-level random nodes. Then the necessary statistics are cached at all the nodes to facilitate removal of specific training instances, without having to retrain the entire decision
Application specific unlearning: While exact unlearning is generally expensive in terms of computation and storage, some applications and their algorithms are more suitable to exact unlearning. Specifically, recommender systems based on k-nearest neighbour models are well suited due to their use of sparse interaction data. Such models are widely used in many techniques including collaborative filtering and recent recommender system approaches such as next-basket Using efficient data structures, sparse data and parallel updates, personal data can be removed from recommendation systems (Schelter et al., 2023).