Re-Identification-Manage Your Risks
Data Privacy and Anonymization
Data privacy impacts everyone. It makes no difference if you are an individual, a small business, a large enterprise, or even a government agency. Many regulations must be followed when handling data. Protecting a person’s private or personally identifiable information or even trade secrets is a complex process. Loss of data can be too costly. As an individual, you could be impacted by identity theft. As a company, you could lose your customer base and go out of business. Even governments must comply; they too are not protected from lawsuits and penalties for loss of data.
Data should go through a sanitization process prior to it being stored. This could be redaction, encryption, or data anonymization. Sanitization merely is removing or altering the ‘sensitive’ data so that if there were a data breach, the loss to the company and the individual is minimized.
One type of information sanitization that is commonly used is called data anonymization. This process intends to protect privacy. The process involves removing personally identifiable information from data sets so that the individuals remain anonymous.
The method of data anonymization is defined as a “process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.”
Once data has been through the anonymization process, it enables the data set to be transferred across boundaries, such as between two departments, two agencies, or even other companies. Using anonymized data sets reduces the risk of unintended data disclosures.
When handling certain types of data, legal restrictions may dictate how the anonymization process is dealt with. For instance, HIPAA, or the Health Insurance Portability Accountability Act, requires that the patient cannot be identified when working with medical data. Any personally identifiable data must be removed. This includes name, address, zip code, birth date, or any other data that could be used to identify the patient.
There are five methods of data anonymization:
- Generalization – A technique that replaces values with generic data.
- Suppression – Actions that remove specific values from the data sets then replaces the extracted data with a specified placeholder ‘*.’
- Anatomization – A process that disassociates relationships between quasi-identifiers (zip code) and sensitive attributes (social security number).
- Permutation – An approach that disassociates the relationship between a quasi-identifier and sensitive attribute by dividing the number of data records into groups and mixing their sensitive values in every group.
- Perturbation – An action that replaces the original values with new ones by interchanging, adding noise, or creating synthetic data.
This process is far from perfect. There are always risks. When sharing data, there is a possibility that anonymized data sets may not stay unidentified. When comparing multiple anonymized data sets, smart algorithms can match data so that it becomes de-anonymized. The data set is no longer secret.
Re-Identification
Re-identification of data can be a problem for many reasons, as data loss always reflects negatively. It can be in fines, penalties, and consumer trust loss, but the overall goal is to prevent it. Re-identification, also known as de-anonymization, is the method of cross-referencing data sets to re-identify the individual subjects.
Re-identification matches anonymous data sets with other available data sets, public information, and auxiliary data, much like a puzzle to discover to whom the data belongs. Many companies have strict privacy policies; some have legal obligations to keep confidentiality under specific laws for both consumers and patients. When you had smart algorithms, artificial intelligence, and machine learning to the mix, it becomes autonomous and can be done while you sleep.
The US Department of Health and Human Services has noted that “re-identification is becoming gradually easier because of “big data” – the abundance and constant collection and analysis of information along with the evolution of technologies and the advances of algorithms.” Some believe that de-identification is a viable solution to privacy protection. Others disagree. However, data breaches do occur, and companies are, in the end, responsible for the data they collect and what happens to that data. Anonymization may be adequate to retain specific data sets for in-house research, but the absolute way to remove identifying information is to use redaction.
Options for Preventing Data Loss
The concern for privacy and data loss has become the number one issue facing corporations and governments today. Many laws and legislations impact how an agency handles personal data. Some rules reach on a global scale if your business intends n to have sales to their residents. Others may be local or even part of a contractual service agreement.
Consumer privacy allows for the trust that mediates the flow of transactions that help the economy, its industries, and its citizens. Its importance to our daily lives has brought the Federal Trade Commission (FTC) to issue its statements and recommendations. “There is significant evidence demonstrating that technological advances and the ability to combine disparate pieces of data can lead to the identification of a consumer, computer, or device even if the individual pieces of data do not constitute personally identifiable information (PII). Moreover, not only is it possible to re-identify non-PII data through various means, businesses have strong incentives to actually do so.”
The FTC has developed a privacy framework or guidelines for organizations and businesses to follow. It asks companies to implement three specific yet significant protections for all data sets to minimize risks. The established guidelines have become a best practice list that is widely accepted.
- The organization or business must “take reasonable measures to ensure that the data is de-identified. This means that the organization must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular individual, computer, or other devices.”
- To develop accountability, companies must “publicly commit” to only maintain and use anonymized data. Also, to assert not to attempt to re-identify the data.
- For businesses that share or sell data to third parties, it should only be anonymized data and under contractual prohibition from attempting to re-identify the data.
HIPAA is just one of many privacy legislations. They have two specified mechanisms for assessing if electronic health data’s privacy standards. Safe Harbor standard states that “health records can be considered de-identified if it contains none of the 18 specified direct or indirect (quasi) identifiers. They list the following types of data that should be removed:
- Name
- Address – This includes “all geographic subdivisions smaller than a state, including street address, city county, and zip code.”
- Dates that relate to an individual (except years) – This includes “birthdate, admission date, discharge date, date of death, and exact age if over 89.”
- Telephone Numbers
- Fax Number
- Email Address
- Social Security Number
- Medical Identification Number
- Health Plan Beneficiary Number
- Account Number
- Certificate or License Number
- Vehicle Identifiers – This can include serial numbers (VINS) and license plate numbers.
- Device Identifiers or Serial Numbers
- Web URL
- Internet Protocol (IP) Address
- Fingerprint or Voice
- Photographic Image – This is not limited to facial images.
- Any other characteristic that could be used to identify an individual uniquely.
Experts have determined that data sets that have these 18 identifiers removed have a minimal chance of being re-identified.
These are two sets of federal-level suggestions for reducing the risk of having your data sets re-identified. There are many laws and legislations around the globe that companies are attempting to juggle compliance. These same companies are also looking at a variety of rules and regulations that concern preventing re-identification. It can get complicated and expensive to meet the standards of every law and regulation. Failure to comply, though, could be costly, so what do you do?
To minimize costs, labor, and stress, set your goals high. Work with privacy experts to understand the legislation that your company needs to understand and show compliance. Instead of trying to meet each standard individually, go a step higher. Take the most stringent regulations as an example. If it exceeds the requirements of other rules, work towards meeting that one goal. There is no failure in exceeding expectations.
This approach can also work when you handle your data and consider anonymization processes. Take multiple suggested best-of practices and work to exceed those goals. To remove the stress of wondering if you and your company did “enough” to protect your data or that of your consumers, go further. You will see that in the end, it will save time, money, and headaches.