Image Data Sets, Technology, and Newfound Privacy Concerns

October 27, 2022 | 4 minutes read

While machine learning and artificial intelligence have given software engineers and technology companies the ability to create new products and services that have changed the way that doctors view medical images, newspaper websites generate paywalls, and fast food companies cook their food, this newfound technology has also ushered in a new wave of personal privacy concerns. With all this being said, the algorithms that are used to power the world’s most prominent technology products and services must be trained on large sets of personal information, be it in the forms of words and sentences, photographs, or mathematical equations, among other things.

To this end, while some multinational technology companies will have the resources and staff necessary to create a new dataset from scratch, many software developers will instead have to rely on data that is already available via the internet. As a result, the personally identifiable data of a particular person could effectively be present within a dataset without their knowledge, as the U.S. has yet to enact any form of privacy legislation that protects this information. To illustrate this point further, in an article that was published by Vice Magazine last month “a user found a medical image in the LAION dataset, which was used to train Stable Diffusion and Google’s Imagen.”

The LAION dataset

For reference, the LAION dataset is described as an entirely open and freely accessible dataset that was “built for research purposes to enable testing model training on larger scale for broad researcher and other interested communities, and is not meant for any real-world production or application.” Nevertheless, this public disclaimer, in addition to the billions of dollars that multinational technology company Google has access to, did not prevent the company from using images contained within the LAION dataset to train the AI tool Imagen.

Described as “a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding”, Google Imagen was not created with public consumption in mind. However, “On the LAION Discord channel, a user expressed their concern for their friend, who found herself in the dataset through Have I Been Trained, a site that allows people to search the dataset.” For context, this user in question submitted her photograph to a doctor 10 years prior “as part of clinical documentation and shared written proof that she only gave consent to her doctor to have the image, not share it.”

LAION’s response

To this last point, when the user that expressed her concerns via the LAION Discord channel began inquiring about having her friend’s images removed from the massive dataset, Romain Beaumont, one of the leading developers involved in the creation of the dataset, as well as a Google employee, responded by asserting that “The best way to remove an image from the internet is to ask for the hosting website to stop hosting it. We are not hosting any of these images.” Subsequently, while this response may be correct from a technical standpoint, it understates the fact that the personal privacy of an individual was violated without repercussion.

What’s more, when reporters for Vice Magazine asked their own questions regarding LAION’s policy for removing personally identifiable information from their public dataset, a spokesperson for the company stated that “We would honestly be very happy to hear from them e.g. via [email protected] or our Discord server. We are very actively working on an improved system for handling takedown request.” In so many words, LAION does not have a concrete system that online users can follow should their personal information be included within their dataset without their consent, in direct contrast to the disclaimer that is visible on the website for the dataset.

LAION and data privacy concerns

On top of the medical images that LAION had obtained from an online user without their consent, there have also been reports that various other non-consensual images were also present within the dataset. Likewise, while the developers that worked to curate LAION have not explicitly broken any laws, the entire fiasco highlights the minimal amount of privacy that consumers within the U.S. have with regard to their personal information, as any person that discovers that their personal information has been compromised for any reason has few avenues for recourse asides from filing a lawsuit against the alleged perpetrator.

In spite of the advancements that have been made in the fields of artificial intelligence and machine learning in the past decade alone, these developments must be weighed against the impact that newfound technology has on individuals. This being said, while the case of LAION is simply one example of a team of software developers failing to do their due diligence as it concerns the creation of a massive dataset containing personal information, there have been many other similar cases that have not been as widely publicized. Due to this fact, the U.S. federal government will have to step in and enact legislation that can serve to protect the personal data of American citizens.