Protecting Against Re-Identification, New Techniques

December 30, 2020 | 7 minutes read

Data Anonymization

It is a legal requirement in many businesses to protect the data they collect on their customers and employees. Personally identifiable information (PII), such as name, address, and social security number – in the wrong hands can create a risk for fraud, theft, and other issues. Many businesses store the data they collect from customers but still face risks in keeping and handling the data.

For instance, businesses spend a great deal on cybersecurity. However, should a hacker or other individual breach the server and make off with the customer data – the company is responsible. The business’s reputation will be lost, and the costs of repairing a data breach could be in the millions.

One method many companies use to store data with less risk is by using data anonymization. Data anonymization is similar to redaction. It is a type of data sanitization. With the intent of privacy protection, it is a term used to remove PII from stored data. This is done so that the individuals’ data is now ‘anonymous.’

According to a study presented by the International Organization for Standardization, the concept of data anonymization has been defined as a “process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.”

The process of data anonymization allows for the transfer of information across boundaries with less risk of privacy violations or unintended disclosures. Companies may sell anonymized data sets to third parties. While any PII is removed or altered, the remaining data, such as health information or purchasing habits, is shared for evaluation and analytics. Research on large groups of individuals for health studies can be done with anonymized data.

Data Re-Identification

Data re-identification is also known as de-anonymization. The process involves taking several data sets, comparing that data with other publicly available information, and determining a subject’s personal identity.

The idea of re-identification is to solve the identity of the individual to which the data belonged. The concern here is that many companies, health care providers, and financial businesses provide privacy policies to their consumers. These companies often release data after the de-identification process.

The de-identification process uses several methods to remove or alter the PII. Widely used terms in describing these methods are:

Masking – also known as data obfuscation, is the process of hiding personal data with modified content.
Generalizing – is the process of creating successive layers of summarized data. It is also known as ‘rolling-up data.’
Deleting.
Direct Identifiers.
Indirect Identifiers.

There are no universal standards for defining methods for this process. When data is released publicly, even anonymized data, it may be re-identified when compared to other available data sets. Algorithms and basic computer science can generate a new data set with re-identified data sets using artificial intelligence to compare several large data sets.

Companies may do this type of research to further promote or target an individual for sales. For example, a data set on diabetic patients that are re-identified through the use of additional data, a company can then target the individual for business purposes. They can send advertisements for diabetic health products and know with certainty they are reaching a diabetic patient.

The US Department of Health and Human Services and other federal agencies have predicted that re-identification is becoming easier. Due to the growth of ‘big data,’ the sets of available data can be continuously gained, compared, and analyzed. This will rapidly make re-identification as easy as the push of a button.

While there is some truth that the re-identification process works, there are still others who make the claims that de-identification is a safe data-sharing tool. These individuals or companies do not see re-identification as a concern. The problem is that these are the same companies profiting from the sale of data that they, too, are re-identifying for their use and profit.

Does Anonymization Really Work?

If data sets can be re-identified, then does anonymization really work? Yes and no, maybe not. If that is the case, why do we continue to use it? Due to privacy legislation, companies comply with the regulation by working to keep data safe. Businesses use data anonymization to maximize the value of the data collected. This way, the business can use the data, lower the risk of a breach of data in storing information, and profit from the sale of anonymized data sets.

Regulations may vary from country to country or state to state; however, generally speaking, any data that meets de-identification requirements or anonymization is no longer considered ‘personal data.’ This means that in most cases, including under the California Consumer Privacy Act (CCPA) and the European General Data Protection Regulation (GDPR), that data that undergoes anonymization has no restrictions on a company’s ability to collect, use, retain, sell, or even publicly disclose the data. While this can be considered a lack of privacy protection, considering the data can be re-identified, when a company fails to de-identify or anonymize the data, they may violate the CCPA or GDPR, which could cause severe penalties and loss.

Preventing Re-Identification

In business, we use de-identification and anonymization as data privacy techniques. We can now understand that with systems that re-identify the data, there is still a substantial risk to both the data and the individuals that it represents.

A few suggested approaches can help reduce the risk and may even prevent most re-identification measures. Since it is a multi-faceted approach, it can better protect against potential privacy threats. Three privacy-enhancing techniques that may improve the de-identification process and reduce re-identification risk are as follows:

K-Anonymization – According to Pierangela Samarati, who published a paper on k-Anonymization in 1998: “Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful.” A set of data is said to have the k-anonymity property if data for each individual in the collection cannot be distinguished from other individuals’ data within the set.
Randomized Response – first termed as a research method used in surveys. It allows survey responders to answer sensitive questions while maintaining confidentiality.
Sampling – data sampling is a known statistical analysis technique. It is used to select, manipulate, and analyze a subset of data points. Generally, sampling is used to identify patterns and trends in a more extensive data set. This type of data handling enables analysts to work with small, manageable amounts of data. It is used to build and run analytical models producing accurate findings of the data set.

The idea is that using all three methods on a particular data set; then it dramatically reduces the risk of re-identified data.

Differential Privacy

As artificial intelligence, machine learning, and other computing disciplines advance, the idea that anonymization protects personal data may be fading. Now that it is becoming easier to re-identify data sets, new methods have reduced the risk to individual privacy.

Privacy experts are turning to an advanced system known as differential privacy. Differential privacy is a method of sharing data by describing groups’ patterns within the data while withholding personally identifiable data. Differential privacy can be defined as “a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose details are in the database.”

An algorithm is considered differentially private if an individual observing its output cannot tell if a particular subject’s information was used in the computation. Government agencies use differentially private algorithms to publish statistics or other demographic data while ensuring subjects of their survey responses’ confidentiality. Data released is precisely controlled, even the date visible to some internal analysts.

Although no data is safe from a data breach or even used for re-identification processes, differential privacy was initially developed by cryptography. It pulls much of its language and terms directly from cryptography. Since some cryptography is involved with removing data, differentially private computations can likely resist re-identification attacks. Fighting against and preventing data breaches or fraudulent use of personally identifiable information may continue to become harder, and significant advances in computational mathematics may be in our future.