GDPR-Compliant Test Data: A Practical Guide
Why fictional data matters under the GDPR, where the law actually says so, real fines, and a flowchart for picking the right strategy.
Almost every team that handles personal data in production eventually faces the same question: can we just copy that into staging? The honest, lawyer-tested answer is “usually no, and certainly not without a written basis”. This article explains where the GDPR puts the line, what the regulators have done in practice, and how synthetic test data fits into a defensible engineering process.
What the GDPR actually says
Article 5(1)(b) — Purpose limitation
Personal data must be “collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes”. When a customer enters their BSN to file a tax return, they consent to that purpose. Reusing that same BSN to debug a UI rendering bug in staging is a different purpose — arguably an incompatible one — unless you can point to a written analysis showing it is compatible (recital 50 factors).
Article 5(1)(c) — Data minimisation
Data must be “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed”. If your test only needs “a string that looks like a BSN”, then a real BSN is by definition more than necessary. Synthetic data is the canonical way to satisfy minimisation in the testing context.
Article 25 — Privacy by design and by default
Controllers must “implement appropriate technical and organisational measures” to apply data-protection principles effectively, including in the design phase. Choosing synthetic test data over copies of production is a classic Article-25 control: it eliminates the risk class of “developer accidentally exfiltrates a staging database” entirely, rather than mitigating it.
Article 32 — Security of processing
The same article that mandates “encryption of personal data” and “the ability to ensure ongoing confidentiality” also applies to test environments. A staging database with the same data as production but only half the access controls is a textbook Article-32 failure.
Real fines and what they teach us
H&M — €35.3 million (Hamburg, October 2020)
The Hamburg DPA fined H&M Hennes & Mauritz over the secret recording of personal employee details — family circumstances, religious beliefs, illnesses — on a network share accessible to managers. While the headline cause was unlawful collection, the investigation surfaced the broader fact that production-grade personal data was being copied into ad-hoc internal tools without purpose limitation or access controls. The case is routinely cited as a warning that “internal use” is not a get-out-of-jail card.
Replika — €5 million (Italy, 2023)
The Italian Garante fined Luka Inc. (the operator of Replika) partly because age-verification was tested using real user data in ways that did not protect minors. The decision emphasised that experimentation on real personal data, especially of vulnerable groups, requires a far higher legal bar than experimentation on synthetic data.
Smaller cases that matter more in practice
The Dutch AP, the French CNIL and the Spanish AEPD have all issued fines (typically between EUR 50,000 and EUR 500,000) against mid-sized companies for sharing production exports with development contractors, leaving test backups on unsecured S3 buckets, or forgetting to disable email-sending in QA environments and spamming real customers. The pattern: the fine size rarely exceeds the cost of building a synthetic-data pipeline in the first place.
Do's
- Generate synthetic data on the fly in unit, integration and component tests. Tools like the kanedias.com dataset generator, Faker.js or domain-specific synthetic-data libraries cover the majority of use cases.
- Use checksum-valid fictional values(BSN, IBAN, VAT) so your validators get realistic exercise without needing real people's data.
- Document a written test-data policy that names which environments may contain personal data, who may access them, and for how long. The policy is one of the first things a regulator asks for.
- Strip data on copy-down if you absolutely need a production snapshot. Replace personal fields with synthetic equivalents before writing the snapshot to disk, not after.
- Set retention on test environments so old datasets cannot accumulate indefinitely. A daily reset is cheap and sleep-restoring.
Don'ts
- Don't copy production into staging “just to reproduce a bug”. Take a minimal, anonymised slice or ask the DPO for a one-off authorisation with a written purpose statement.
- Don't pseudonymise and call it anonymised. If the lookup table exists somewhere, the data is still personal data (Article 4(5) GDPR).
- Don't email production exports to contractors. Even if it is “just one CSV”, it triggers Article 28 processor-contract obligations and a transfer assessment.
- Don't reuse last year's test fixture if it contains real names. Old fixtures are a classic hidden source of personal data; rotate to synthetic.
- Don't skip the test environment in your DPIA. If processing in production needs a Data Protection Impact Assessment, a similar processing in staging needs one too.
Decision flowchart
Use this flowchart at the start of a new feature to decide what test data you may use:
- Does the feature touch personal data in production? If no, real data is irrelevant; use synthetic data.
- If yes, can the feature be tested with synthetic data that respects the same format (BSN checksum, IBAN modulo 97, email RFC 5322, etc.)? Almost always yes. Use synthetic.
- If the test absolutely requires the statistical properties of production (load testing, ML training), use a properly anonymised extract. “Properly” means k-anonymity ≥ 5 and re-identification risk < 5%, signed off by the DPO.
- If even that is impossible, use real production data, in production-equivalent environments only, with the same access controls and retention. Document the legal basis (typically legitimate interests with a balancing test).
Where kanedias.com fits
Every generator on this site produces values that are either mathematically random within a checksum constraint (BSN, IBAN, credit card), drawn from a public pool of fictional names (name generator), or constructed from format rules (email, UUID). Generation runs entirely in your browser, so no input you type and no value the generator produces ever reaches our servers. That is the simplest possible synthetic-data pipeline for small datasets, and it slots into a CI run via curl && jq on the JSON export, or a Node script that imports the generator modules directly.
Closing thought
The cheapest GDPR control is the one you build into the development workflow rather than the legal review. A test that uses synthetic data cannotleak real users' information, regardless of how badly the test environment is misconfigured. A test that copies production data can, even if the rest of the controls are perfect. Pick the option where the worst case is boring.
Frequently asked questions
Is it OK to use a sanitised copy of production data in staging?
No, not in general. The Dutch DPA (Autoriteit Persoonsgegevens) and the EDPB have repeatedly stressed that pseudonymisation is not anonymisation: a re-identifiable dataset is still personal data. If staging is accessible to engineers who would not be authorised to view production records, you have a Article 5(1)(b) purpose-limitation problem. Use synthetic data instead, or run a proper k-anonymity / differential-privacy transformation that has been signed off by your DPO.
Are fictional BSN or IBAN numbers personal data under GDPR?
No, as long as they are not linked to a real natural person. A randomly generated BSN that happens to coincide with a real one is still not personal data in your hands because you have no way to identify the person it belongs to and you have no intent to do so. The risk is contextual: if you store the same number alongside a real name and date of birth, you have created a re-identifiable record.
Has anyone been fined for using real data in tests?
Yes. The 35 million euro H&M fine (2020, Hamburg DPA) was triggered partly because employee personal data was processed beyond the original purpose, including in test environments. The Replika fine in Italy (2023) involved untested age-verification using real user data. In the Netherlands, the AP has fined organisations for sharing production data with development contractors without a proper basis.
What is the cheapest way to be GDPR-safe in CI?
Generate test data on the fly inside your CI pipeline using deterministic seeded generators. The data exists only for the duration of the test run, never touches production storage, and contains no personal data by construction. This is the model used by tools like Faker.js, the kanedias.com generators, and most synthetic-data platforms.
This article is general engineering guidance, not legal advice. For specific cases, consult your DPO or qualified counsel.