---
title: 'Mastering sensitive data handling and GDPR compliant secure data removal with event sourcing'
date: '2024-11-07'
author: 'daniel-badura'
tags: [ 'PHP', 'EventSourcing', 'GDPR' ]
contentPreview: 'This post explores techniques for handling data removal in event-sourced applications, focusing on compliance and privacy needs, like GDPR. By using methods such as Crypto Shredding and Tokenization for anticipated data deletions, and History Rewriting for unexpected cases, developers can securely remove data while maintaining an immutable event store.'
---

In one of our recent blog posts, we showed the benefits that event sourcing can bring to your project and business. One
key takeaway was that we no longer lose any data. However, this can also lead to problems in specific situations, such
as when removing data sets. What do we do if we have data that needs to be removed? Is this possible with event
sourcing? How can this be done with an immutable event store? I will explain how you can overcome this problem and
how [our PHP event sourcing library](/docs/event-sourcing/latest) handles it.

## Expected data removal

First, let us begin with the simpler case. We already know that the data will need to be removed in the future. A good
example could be personal data from our users. Due to the EU regulations of GDPR, we need to ensure that we can delete
all personal data from a user. Let's take a simple example: we have an aggregate where we save the name of a user, and
this name can also be changed, which is handled by a dedicated event called `NameChanged`.

```php
use Patchlevel\EventSourcing\Aggregate\Uuid;
use Patchlevel\EventSourcing\Attribute\Event;

#[Event('name_changed')]
final class NameChanged
{
    public function __construct(
        public Uuid $id,
        public string $name,
    ) {
    }
}
```

Now, if we save this event, it will be part of our immutable event store, so we cannot update it afterward. However,
since we already know that this could cause problems due to GDPR, we can take action to prevent issues. The solution is
to avoid saving sensitive data directly in our event store. We will discuss two options here: Crypto Shredding and
Tokenization.

### Crypto Shredding

With Crypto Shredding, we save the data encrypted in our event store. This way, we don’t have the name in plaintext in
our database but instead as an encrypted string. The key used to encrypt the data is saved separately. This could be in
the same database in a different table, a separate database, or even on the filesystem. Why are we doing this? With this
setup, we can “delete” the data at any time. As soon as the user requests removal from our system, we delete the
encryption key for that user. When this happens, the encrypted data in our event store becomes unreadable: et voilà,
problem solved.

Our library supports [Crypto Shredding](/docs/event-sourcing/latest/personal_data) and is easy to use.
For implementation, we provide some attributes: `#[PersonalData]` to mark sensitive data and `#[DataSubjectId]` to
identify and save the encryption key. For encrypted data, there is also the option to provide fallback data if desired.

```php
use Patchlevel\EventSourcing\Aggregate\Uuid;
use Patchlevel\EventSourcing\Attribute\Event;

#[Event('name_changed')]
final class NameChanged
{
    public function __construct(
        #[DataSubjectId]
        public Uuid $id,
        #[PersonalData(fallback: 'anon')]
        public string $name,
    ) {
    }
}
```

Here, we would get `anon` for our `name` property if decryption is unsuccessful. With this fallback, our application can
still function as expected without crashing due to missing data. Next, we have the configuration for the type of
encryption to use. If you are
using [the symfony bundle](/docs/event-sourcing-bundle/latest/configuration#cryptography), the
configuration is a breeze:

```yaml
patchlevel_event_sourcing:
  cryptography:
    enabled: true
    algorithm: 'aes-256-gcm'
```

Now, our `DoctrineCipherKeyStore` will be used to store the encrypted data. Since it's based
on [doctrine/dbal](https://github.com/doctrine/dbal), a wide range of databases is already supported. The algorithm is
used by our `openssl`-based implementation to encrypt and decrypt the data.

### Tokenization

Tokenization is another technique that can be used to prevent saving sensitive data in the event store. With this
approach, we don't encrypt the data and save it to the store. Instead, we send the data to
a [vault](https://www.piiano.com/) and receive a token in return. This token is then passed around in our application
and stored in the event store. Whenever we need the **real** data, we query it from the vault, which returns the data we
need.

One advantage of this solution is that we are now only handling tokens in our domain instead of sensitive data. This
reduces the potential issues you could encounter. What do I mean by that? Well, unintentionally leaking sensitive data
becomes highly unlikely, since you need to explicitly access the vault for that. Retrieving this data often requires a
valid reason, which also increases the auditability of the data.

Unfortunately, we don't provide a solution for tokenization in our library, as we believe that tokenization should be
done before the data touches the persistence layer. Therefore, tokenization is out of scope for the library. However, if
you disagree, don't hesitate
to [open an issue or even submit a PR on GitHub](https://github.com/patchlevel/event-sourcing)!

## Unexpected removal of data is required

Now, let's talk about the more challenging part: the case of deleting data we did not anticipate needing to remove. The
event store is immutable, and this case is no exception, so manipulating the event store is still a no-go. This may seem
like an impossible task, doesn't it? But don't worry, there is a solution.

### Rewrite History

We cannot update the events in our store, **but** we can recreate our store. What do I mean by that? I mean reading all
of our events and writing them into a new store. Between these two operations, we can perform whatever changes we need.
This could involve dropping a complete stream, editing values for placeholders, or applying one of the previously
described solutions. The result will be a cleaned-up new event store without the data we needed to remove. We could also
get rid of some [upcasters](/docs/event-sourcing/latest/upcasting) in this process if we change the
events in the same way our upcasters did.

We are working on [a new feature](https://github.com/patchlevel/event-sourcing/pull/643) that will simplify the task
considerably. With this, you can read the current store and execute a list
of [translators](/docs/event-sourcing/latest/message#translator) on the messages. These can include
things like renaming, updating, filtering, or even creating new events. After that, the new message stream is written
into our new store. Once the new stream has been tested, we can switch our application to use the new event store.
Recreating the new event store may take some time if our old store has already grown over time.

```php
$oldStore; // the currently used store with events to be removed
$newStore; // the new store which should be used which is still empty

$pipeline = new Pipe(
    $oldStore->load(), // load all events of the old store
    new ChainTranslator([
        new AnonymizeUserInformationTranslator(), // you can update sensitive values of events
        new MapProfileAdressToProfileLocationTranslator(), // or map events to different one without sensitive data
        new ExcludeEventTranslator([ProfileNameUpdated::class]), // or even drop whole events
        new RecalculatePlayheadTranslator(), // we need to recalculate the playhead if we are dropping or adding new events
    ])
);

$newStore->save(...$pipeline);
```

The example above shows how you could create a one-time command to test the process and migrate the old store to the new
one. I included multiple translators to demonstrate that there are many possible ways to handle these situations. One
option could be to anonymize the data using our crypto-shredding feature to remove plaintext data from the store.
Another solution is to map the event to a different event that excludes the sensitive data. The final approach is to
drop entire events. For each case, you should thoroughly test the application afterward to prevent failures.

For a more sustainable solution, we recommend using
the [subscription engine](/docs/event-sourcing/latest/subscription) to execute the migration for
several reasons. First, this allows us to batch saves to the new `Store` easily if we use the `BatchableSubscriber`.
Second, we can run this in parallel within our application and recreate it easily if anything goes wrong. Lastly, schema
creation is also handled automatically.

```php
#[Subscriber('migrate', RunMode::Once)]
final class MigrateStoreSubscriber implements BatchableSubscriber
{
    private readonly SchemaDirector $schemaDirector;

    /** @var list<Message> */
    private array $messages = [];

    /** @var list<Translator> */
    private readonly array $translators;

    public function __construct(
        private readonly Store $targetStore,
    ) {
        $this->schemaDirector = new DoctrineSchemaDirector(
            $targetStore->connection(),
            new ChainDoctrineSchemaConfigurator([$targetStore]),
        );

        // same translators as above
        $this->translators = [
            new AnonymizeUserInformationTranslator(),
            new MapProfileAdressToProfileLocationTranslator(),
            new ExcludeEventTranslator([ProfileNameUpdated::class]),
            new RecalculatePlayheadTranslator(),
        ];
    }

    #[Subscribe('*')]
    public function handle(Message $message): void
    {
        $this->messages[] = $message;
    }

    public function beginBatch(): void
    {
        $this->messages = [];
    }

    public function commitBatch(): void
    {
        $pipeline = new Pipe($this->messages, $this->translators);
        $this->messages = [];

        $this->targetStore->save(...$pipeline);
    }

    public function rollbackBatch(): void
    {
        $this->messages = [];
    }

    public function forceCommit(): bool
    {
        return count($this->messages) >= 10_000;
    }

    #[Setup]
    public function setup(): void
    {
        $this->schemaDirector->create();
    }

    #[Teardown]
    public function teardown(): void
    {
        $this->schemaDirector->drop();
    }
}
```

## Conclusion

In this post, we explored how to address the challenges of data removal in event-sourced applications, focusing on cases
where compliance and privacy laws require specific data to be deletable. With techniques like **Crypto Shredding** and
**Tokenization**, we can handle personal data securely, either by encrypting sensitive information with removable keys
or by storing tokens instead of actual data. These approaches ensure handling sensitive data and GDPR compliance by
enabling data to be effectively deleted from an immutable store.

When unexpected data deletion is needed, we can **Rewrite History** to re-create the event store without sensitive data.
By reading, modifying, and then writing back events into a new store, developers can meet legal or business requirements
without altering the integrity of the event-based architecture. Together, these solutions allow applications based
on [event sourcing](/docs/event-sourcing/latest) to handle data removal securely and flexibly, ensuring
both regulatory compliance and system resilience.
