What Synthetic Data Means to Information Privacy

January 13, 2021 | 6 minutes read

Synthetic Data

Improving the capacity to share data without impacting personal privacy has become an expanding trend in data analytics. Synthetic data is a newly looked at emerging tool being considered an option for privacy protection in data science. What is synthetic data? According to the McGraw-Hill Dictionary of Scientific and Technical Terms, synthetic data is “any production data applicable to a given situation that is not obtained by direct measurement.” For privacy reasons, synthetic data consists of data not based on any real-word individuals or events. Still, data generated by a computer program used to simulate the information.

In the field of data management, synthetic data and production data are terms used interchangeably. Production data is defined as “information that is persistently stored and used by professionals to conduct business processes.” It is real information, generated by AI to simulate an equivalency to the ‘actual data’ so that businesses can use the information for research or other studies. Since this type of data does not include the ‘actual data,’ it provides personal data protection for those in the data set.

Imagine a data set:

Actual Percentage – Changed – to Synthetic Percentage

Actual 55.7 – Synthetic 54.6
Actual 58.4 – Synthetic 59.5
Actual 60.1 – Synthetic 59.9
Actual 53.7 – Synthetic 53.9

Totals: Actual 227.9 – Synthetic 227.9

Average: Actual 56.975 – Synthetic 56.975

With this simplified example, you can see how data over the entire set can be simulated to give the same results while hiding the original value. Generally speaking, any data generated by a computer simulation is considered synthetic data. The generated data can be used in physical modeling, medical research, or even community health needs. It gives way to model analysis with accurate data sets that do not necessarily point to any individual’s data.

To provide privacy protection, synthetic data is created through a complex process of data anonymization. It can be described that you have a data set, it is then anonymized, then that anonymized data is converted to synthetic data. This breakdown shows synthetic data as a subset of the anonymized data set.

Various fields and business types use synthetic data as a filter. Synthetic data acts as a filter layer to help provide privacy protection and confidentiality of the data subjects, who may otherwise be compromised. Many data sets used in research include synthesized data that protects specified data fields that reveal personal identity; including, name, home address, IP address, credit data, or social security number, in other words, the data that points to a particular individual.

Today, data collection that surrounds individuals’ daily lives allows for a myriad of ways to match data sets to pinpoint a specific subject. In a 2016 study, artificial intelligence can monitor driver braking patterns and, within 15 minutes, identify the driver with an 87% accuracy. So much of our daily data, include an insignificant action like the way you brake while driving is unique to the individual. This is why there is such a need for synthetic data.

Privacy Protection with Synthetic Data

How does synthetic data impact privacy protection? Synthetic data improves privacy protection because synthetic data is artificially generated, not real-world information. Where data that is simply anonymized can be used for re-identification purposes when compared to other similar data sets, synthetic data points to no one specifically. In this sense, synthetic data offers superior privacy protection.

Companies often use synthetic data available for processing when there are concerns that releasing the original data may violate privacy regulations. Processing consumer data requires strict compliance. Regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) could levy huge fines and penalties for releasing private consumer data.

In instances where consumer privacy is an issue, synthetic data is used. It is a form of anonymization that allows companies more agility to use, process, analyze, or share the data in a safe and compliant manner. Synthetic data is used explicitly for the preservation of privacy. “Synthetic data is described as artificially generated data that contains properties of the original data without disclosing the actual original data.”

Everything Has Limitations

Companies have been turning to synthetic data as a viable option to balance data privacy with the need for quality data. Synthetic data has often been described as the answer for providing complete data values while ensuring privacy protection. It may be a bit more complex, though, to call it perfect.

It may seem that synthetic data is the ‘answer’ to solving the need for quality research data without compromising privacy, but nothing is that easy. Synthetic data has its limits. As an answer, it has limitations due to fundamental mathematical constraints. To have a perfect solution, allowing for both privacy and equal data values in a single dataset is mathematically impossible.

It could be like comparing perpetual motion to a band-aid solution to privacy, knowing that it is proven scientifically false. Giving it the status of a Star Trek replicator is deceptive. Without further study, to say that it is a perfect solution is a misrepresentation. Companies should be aware that there are consequences for utilizing newer solutions that may require more study. Those with relatively extensive knowledge in the field are noting some shortcomings. These limitations could lead to breaches of customer privacy and penalties from violations of current privacy laws.

To be sure, to claim that synthetic datasets can be statistically identical to the original data and perfectly preserve privacy is impossible. It does have its benefits in providing highly accurate statistics for study and research. It also can claim to be differentially private.

Well, perhaps for the average person, it comes close. This isn’t perfect; just as AI and ML can use smart algorithms to simulate data, they can also be used with other data sets to unravel and match specified data.

The truth is it does provide a much higher level of data privacy than other means currently available. To have an application that provides or guarantees 100% protection will always be proven false. This level of security and accuracy, from a scientific perspective, will not be achieved with any technology. This applies to all future innovations that are beyond Star Trek replicators and teleportation devices. To claim synthetic data is a perfect solution is not credible, but understanding the shortcomings and using the information sets with care can give much greater protection than any other anonymization solution currently available.

Privacy and AI – The Future

Many companies are using synthetic data in training artificial intelligence (AI) and machine learning (ML) applications. Real-world data can be expensive to collect, but synthetic data with an equivalent amount of data is more easily acquired. One central area in which privacy is protected and AI is developed for a specific purpose is developing autonomous driving vehicles. Creating the software that allows for safe autonomous driving uses volumes of data to learn and react to driving conditions.

For these types of data applications, synthetic data allows AI and ML models to react to a wide variety of situations that even real-world data may not demonstrate.

Synthetic data is a viable option. Corporations often use it to evaluate vendors. When choosing a vendor that may need to handle consumer or private data, the risks can be assessed without releasing the actual data. In any situation where data is exchanged, it increases the chance for a data breach, which could cause significant damage to the reputation of a business, causing fines, legal costs, and loss of revenue.