Using Differential Privacy to Conceal Data

December 31, 2020 | 7 minutes read

Methods to Conceal Data

Today’s advanced technology and that people use it every day leaves trails of data behind every one of us. The data can be gathered by unknown parties and analyzed. These data crunchers can determine your health problems, track your movements throughout the day, and even decide whether you are experiencing depression.

No one wants to leave out their personal data, or health information, much less have someone find data that points to mental distress. Identity theft, bank fraud, and many other crimes are committed by bad apples who steal people’s personally identifiable data. When a company is responsible for handling large amounts of customer data, it must maintain its trust to continue having a good relationship with them. Releasing personal data or even loss of data through a breach could mean extensive losses for a business.

Redaction is one method of concealing data. Redaction is a form of editing in which confidential information is replaced with a black box to indicate its presence, but the data is masked. An alternative term for the practice is called sanitization. When data is sanitized, all personally identifiable information has been concealed or removed.

Anonymization is another method used to conceal data sets while keeping some information intact. The purpose of anonymization is a privacy protection and is also a form of sanitization. Data that should not be released publicly, such as name, social security number, or home address, is removed, leaving the remaining data for research and other purposes.

There is some controversy over the ability of anonymization to conceal identifiable information. Today’s technology is leaping forward by heaps and bounds. With artificial intelligence and the correct algorithms, data sets can be compared, and the missing data can be figured out. When the data is queried together, and a match to a positive identity is found, this results in de-anonymization. The resolution that many are now considering is termed ‘differential privacy.’

Differential Privacy

As big data corporations continue to soak up data sets like a dry sponge to water, privacy activists are re-thinking anonymization. With the realization that de-identification can be reversed, proponents of a new cybersecurity model known as differential privacy have come forward. It has become apparent with the advent of big data, machine learning, and data science advances that today necessitates a reconsideration of previous privacy methods.

Cybersecurity specialists now claim that using differential privacy (DP) methods can better protect personal data than traditional methods. DP is a state-of-the-art concept based on mathematical algorithms that have been recently developed. This new privacy model’s belief is pushing larger companies to turn to DP methods to protect privacy.

It is already being used by companies such as Apple, Uber, the US Federal Government (Census Bureau), and Google. Differential Privacy or DP’s primary mission is the requirement that a data subject is not harmed by their personal data being entered into the database. It also necessitates maximizing utility and data accuracy for the results.

Corporations that use DP participate in a system for sharing data publicly by describing the data set patterns while withholding the subjects’ personal data. The concept relies on the effect of making a small substitution in the data set, making it nearly impossible to infer details of those in the study. Since data subjects are never identified, it provides a better alternative to privacy. It can also be described as a restriction on the algorithms used to publish large data sets, limiting disclosure of any personal or private data within the collection.

Data meets the standard for differential privacy when the output cannot be used to identify a particular subject’s personal data. When dealing with data breaches and reidentification attacks, DP is likely to resist such an invasion or loss of sensitive data. Since the work of cryptographers developed DP, it is often closely linked to cryptography. Much of the language used in algorithm development comes from cryptography.

Random Noise

Implementing the processes involved with differential privacy can be a matter of adding random noise to the data. You want to publish how many people in the dataset satisfy a given condition. Adversarial companies have nearly the same data that you do and could compare the published results to re-identify the data. Since this is something you are trying to avoid, take a moment to understand how to add noise and never post the exact answers.

If you had an attack on your data, you should assume that they have similar data sets. They don’t have an exact identity or target. It would be like wanting to hit the center target while playing darts. Each ring from the outside in – gets you closer to the answer. Given the small mathematical value, darts can hit a fractional distance from the center and can actually hit the center. The average that you get from this numerical data indicates the exact center, but no answer is so precise as to match it to any existing subject.

We can compute the exact answer in reality, but we add the noise to prevent identifying an actual individual. The noise comes from a probability distribution, also known as Laplace distribution. Each distribution has a parameter that indicates a value that may not be exact but can give researchers the results they need for analysis.

Balancing Utility and Privacy

Data scientists like to assign a numerical value to everything they see. Every part of your day is a data point. The brand of shampoo you use, the coffee you drink, the distance you drive to work – literally everything you do is a data point. While some of us understand this, we often don’t consider the details obtained by this data. Corporations or governments can use it to make inferences about your health, behavior, and lifestyle. The point of using differential privacy is to use data for studies, such as health data for diabetes, without the price of subjects’ private information being exposed or exploited. It is about striking a balance between utility and privacy expectations.

Sensitivity

When discussing the term ‘sensitivity’ as it applies to differential privacy, we are talking about parameters. The parameters define how much noise is required in the differential privacy functions to get good results and eliminate data de-identification.

To determine the sensitivity, the maximum change or possible range for results needs to be calculated. It refers to the impact a single change in the data set could have on query results. For example:

Let xA, XB equal any data set from all the possible data in database X, which differs by a single element.
In this case, the equation would look something like this:
Sensitivity = max (xA, xB)(CX) |q(xA) – q(xB)|

The queried results are fractionally close to the actual answer. Understanding the maximum and minimum values helps researchers learn more about the effects of their query.

Laplace Mechanism

The Laplace mechanism is a mathematical tool for implementing differential privacy on some query or function (f) to be executed on a given database. It is accomplished by adding noise to the output of (f), leaving the outcome or results defined within a given parameter.

Mathematically speaking, a function computing the average or the standard deviation would look much like this:

Let f(x1, x2, …, xn) be the function used on data within a database or data set.
‘f’ can be considered the function that computes and returns the average or the standard deviation for a set of values.
Let ∆f = Max x, x0 | f(x)−f(x0)
∆f is the ‘sensitivity’ of the function. This is also the maximum difference in values the resulting (f) can accept when executed.
The function is used on database x and x’; the databases are nearly exact but differ in precisely one piece of data.

The output of the function of (f) on some database x is f (x) + b where b is equal to the noise value.

The Laplace Mechanism provides the overall aim of adding noise values to satisfy differential privacy. The algorithm computes f accurately and close to the best data result we can extract from the query.

Keeping Data Under Lock and Key

There are a variety of processes used to help protect sensitive data. To improve everyone’s lives, with the invention of health discoveries, and more, the first thing is to research data sets on large groups of people. These methods allow the data to be utilized while still doing all that is necessary to protect private data. As technology advances, we will have to develop better strategies and more advanced algorithms to keep data safe.