Data Pre-Processing in AI and ML, Quality Over Quantity

July 20, 2022 | 5 minutes read

While data in the context of consumerism and privacy is an expansive and far-reaching term, the data that software engineers use to create machine and deep learning algorithms must be fine-tuned and refined. As machines do not have the inherent characteristics of the human mind, a software developer that trains a machine learning model in conjunction with a particular dataset must ensure that said data is as accurate, precise, diverse, and relevant as possible to the specific task at hand. With all this being said, data pre-processing in regard to machine learning refers to the practice of encoding or transforming a set of data to make it easier for a machine to parse. Due to the prohibitive costs associated with machine learning, data pre-processing is crucial to creating a quality data set.

To illustrate this point further, while multinational technology companies such as Amazon, Google, and Microsoft will have the resources and facilities needed to create their own large-scale data sets, the average team of software developers will instead need to rely on free datasets that have already been created. This being said, these datasets will often be labeled or categorized by ordinary people, and as such, will often contain missing, inconsistent, or inaccurate data. For these reasons, the team of software engineers in question would first need to process the data they obtain from a publicly available data set before feeding such data into a machine learning model for training purposes. Likewise, data pre-processing is implemented in accordance with four main stages.

Data cleaning

The first stage in data pre-processing is data cleaning. As the name implies, the data cleaning stage involves a basic refinement of the data contained within a data set, including missing or incorrect values and major outliers, as well as any other inconsistencies that may arise during the process. Data cleaning can be accomplished in a number of ways depending on the scope and scale of the data that a software developer is working with. For example, a developer that is looking to train a machine learning algorithm to identify financial details contained within a data set of 100 checks could fill out any missing or incorrect data within said checks by hand. Alternatively, a data mining technique such as regression could be used to clean up a data set that is comprised of thousands of different images, sounds, or words, as doing so by hand would be an extremely arduous task.

Data integration

The second stage in data-pre-processing is data integration. Data integration involves merging data that is present within multiple sources into a single data store. For example, a software developer that is looking to create an AI algorithm for use in medical imaging software would likely need to integrate images that have been obtained from multiple sources, due to the inherent nature of healthcare. During this integration process, this software developer might also remove any redundant data, such as exact copies of medical images, as well as any mismatch or out-of-place data, in an effort to refine the data set even further. On top of this, a particular data set may also contain data value conflicts that must be addressed before the data can be used to train an AI model, such as two data formats that differ from each other with respect to the month and day of the year i.e. “MM/DD/YYYY” or “DD/MM/YYYY”.

Data transformation

The third stage in data pre-processing is data transformation. During this stage, all of the data that has been cleaned and integrated will then be transformed into a cohesive format that can be used to analyze the said data. Data transformation can be implemented in a number of different ways. For example, data aggregation can be used to summarize data that has been obtained from multiple sources, such as voter turnout data that has been collected from a particular U.S. state during a presidential election. Conversely, generalization is another method that can be used during the data transformation stage, as more specific forms of data such as the address of a street within a major metropolitan area can be transformed into data about the country in which this area is located.

Data reduction

The final stage in data pre-processing is the data redaction stage. As some datasets will still be extremely hard to analyze and critique even after the data has been cleaned, integrated, and transformed, the data reduction stage is used to formulate a reduced representation of a dataset that produces the same quality of data in a smaller volume. For example, a software developer that is looking to refine a data set for use in a text analysis software program could remove any superfluous data from the set, as the flexible nature of human communication means that ideas and concepts do not have to be expressed using a set number of words or phrases. As such, the data reduction stage works to both make the data more accurate and accessible, while simultaneously cutting down on the amount of storage that is needed to retain said data.

It cannot be overstated that the effectiveness of a machine learning or artificial intelligence model will be largely dependent on the quality of data that was used to train the model. While human beings can learn things on the fly, make inferences, and ask questions to other people should any complications occur, machines obviously do not have the same luxuries. To this point, software developers must exhaust all possibilities when modifying the datasets they will use to train their respective algorithms. Without data pre-processing, many of the landmark advancements that have been made in the fields of computer vision, artificial intelligence, and machine learning during the past 20 years would have been far more difficult to achieve.