How to Redact Documents Properly

How to Redact Documents Properly

There will be a time that you need to share documents, whether it be for business or personal reasons, and that might be the moment that you realize, “I should probably remove some of this information before sending this out!”. You are more than aware that sensitive information in the hands of the wrong person can lead to identity theft, criminal activity, and many security issues. This is why document redaction is a highly recommended method for individuals, and small or large organizations to secure their important and sensitive data.

Why Redact Documents

Document redaction is the task of eliminating confidential information without hindering its content. It should be no surprise that document redaction is widely used and frequently requested. Evidence of document redaction can be traced back to the 13th century when palimpsesting, the practice of replacing text when parchments was scarce, of a 10th-century manuscript occurred. Even if it didn’t exist today, you would’ve discovered it yourself. It’s like when you passed notes in class and realized you shared a bit too much, you would scratch that portion out to make sure it’s unreadable, right? You just redacted some sensitive information.

Of course, at a much more consequential scale, there are serious repercussions if you don’t redact sensitive data properly. Take, for example, the document redaction mishap that occurred with Paul Manafort and his legal team in 2019. Although the publicly displayed document looked redacted, it was redacted poorly.

For example, highlight the following sentence with your mouse and you will be able to see that though it looks redacted, the sensitive information is available for anyone to view. The simple layer of “redaction” that is applied is not enough to secure data.

Health Record – Patient: John Smith Date of Birth: 03/01/1950.

People were able to use document viewing tools to see what was underneath the black bar covering up the sensitive information. Needless to say that they realized their redaction mistake too late, causing uproar in the press. Highlight the following sentence to see

There are Laws that work to help you receive the information you have rights to obtain.

There are laws and regulations like the Freedom of Information Act, in place that deem it necessary to redact and have specific steps and procedures that must be followed in place. One of the most recent ones circulating the media is from the Arizona state legislature. Arizona’s revised statutes regarding public record requests have pending legislation that includes specific requirements for the government to follow when complying with a public records request.

As of February 27th, 2023, the pending legislation includes House Bill 2808 which indicates a big shift and amendment to the already existing Arizona’s Public Records Law. It may now be required to oblige to (or deny with a categorization of reasons for denial listed) public record requests within 5 days of the request before potentially facing a civil fine of up to $5000.

The Process of Document Redaction

If you’ve used a manual document redaction tool before, you will more than likely know how to use the manual redaction feature in almost every document redaction tool out there. The saving grace of document redaction is its AI capabilities. Some redaction software will give you the option to select an AI automatic redaction feature to allow the AI to run through the entire document and pinpoint all of the Personal Identifiable Information (PII) that is in your document to redact. To the end user, it’s just a click of a button, but do you ever wonder how it is able to read and know that the words highlighted are actually PII?

To Start Redacting

To begin, it’s important to understand that not every document needing redaction will automatically be text-searchable ready. If you receive hundreds of scanned pages that have PII, you won’t be able to “ctrl F” and search the word you see in front of you. The machine sees the PII that was scanned as an image, not text. To fix that, you have to first OCR the documents.


Optical character recognition, known as OCR, is the process of converting scanned texts into something that is machine-readable. It is a descendant of the optophone, a machine from the early 1900s that worked to assist the blind in reading by using different tones based on dark and light spaces detected on paper.

Today’s OCR program will go character by character to ensure it’s detected and recognized properly and transform the written information in the scanned image into text using a similar method. It has to go through and “pre-process” or clean up the image of the text, by realigning the letters, resizing them, and removing unnecessary marks, so it can attempt to “read” the letters. This transformation is usually possible using different kinds of algorithms, two of them being pattern recognition, and feature extraction.

With pattern recognition, the method is to look at each individual character and compare it to the character library it already has stored. This way, when it’s done searching through the many fonts and sizes of each letter and a match is found, it can label appropriately.

The feature extraction methods are dependent on the specific lines, direction, and interaction between those lines to conclude what the character could be. Machines are able to OCR handwritten documents as well which is almost as impressive as a pharmacist being able to read your doctor’s handwriting for your prescription. Machines pre-process handwritten texts using different models such as a sequence model called the Hidden Markov model.

Once the cleanup is done, the next step is to get as much detail as possible. Words are sometimes divided up to allow the machine to scan small portions in different directions to be as specific as possible when extracting details. This process, known as Multi-dimensional Recurrent Neural Network is repeated in order to build on its previous findings and ultimately have an output based on multiple layers of information.

After the layers of information are gathered to detect text during the document redaction process, it’s passed on to a Connectionist Temporal Classification, an algorithm focused on concluding its outcome based on spacing, position, and probability to conclude what letter it is. Finally, the text is extracted and searchable, ending the text recognition pipeline. This is pretty impressive, especially considering that this can be done for writings in multiple languages.

Identifying PII

Of course, your document does not always need to be “OCRed”, if it is already readable, the process of redaction will begin with your program finding and detecting PII.

Specific to PII recognition, there are multiple techniques and algorithms that can be used to get the job done such as rule-based matching, machine learning, and natural language processing. Rule-based is self-explanatory as it will use a set of preselected rules based on patterns to indicate if it is a PII. For example, we can’t guess what “123456789” may represent but if it’s written as “123-45-6789”, we can predict that it is a social security number.

This method can be accurate but the scope of PII it can identify is limited to specific PIIs. It also means it is less likely to identify something as a PII if it is written incorrectly, which is likely to happen, so it will be best to use this method alongside a machine-learning technique.

There is a pattern when it comes to software using machine learning, and it’s that the performance is an undeviating result of how the machine learning portion was trained. In order for the software to do anything, especially human-like, it has to be trained with a vast amount of data and through a powerful pipeline. The machine learning technique can include labeled data sets so it can follow specific examples to learn from or it can use clustering algorithms to self-teach. Regardless of the method, it will allow more PIIs to be recognized whether the PII is informally written or contains minor errors because it would be trained to do so.

The final method mentioned is natural language processing (NLP) which is another advanced method that uses the context of a sentence to anticipate PII using a pre-trained language model like BERT aka Bidirectional Encoder Representations from Transformers. With this method, it can extract PII simply because the text prior to the PII is “You can contact me at…”. Being able to recognize PII but also using the surrounding texts for more accurate predictability leaves little room for false positives, ensuring proper detection of the PIIs in your documents.

Let’s Redact

Now that all the PIIs are detected, the most important step is to redact them. Redacting documents means obscuring information to where it is unrecognizable while preserving the sentences meaning in its entirety. This involves blacking out or removing certain texts or phrases. A good redaction tool is able to detect the words you are looking to redact and erase the information before adding the layer of solid color to show the effect of it being blocked off. This way, if the block of color is edited off the page, or a mischievous individual uses tools to change the shading to see what’s behind it, it would only lead to a blank space. You have now gone through the process of OCR, text detection, text recognition, and finally, text redaction!

The Different Features in Document Redaction

While automating your document redactions is helpful, there are also different features that further help improve your document redaction experience. Here is a list of some of the more common features for document redactions.

Document redaction is an unavoidable and critical part of ensuring the security and confidentiality of our private data. It is important for individuals and businesses to prioritize proper document redaction, not only to avoid negative consequences but to maintain proper and consistent security of those that will be impacted by potential threats.

Related Reads