Data Augmentation, ML Algorithms, and New Technology

July 18, 2022 | 5 minutes read

While machine learning algorithms have led to the creation of state of art and vanguard software products and services in recent years, the datasets that enable the creation of such technologies can be prohibitively expensive and time-consuming to create. As the accuracy of a machine learning model is largely dependent upon the breadth of the training data that was used to create such a model, the ability to leverage large datasets is pivotal to creating models that are able to perform in a manner that is both efficient and accurate. Due to this fact, many popular technology products that use machine learning algorithms are created by major technology companies such as Apple and Microsoft, as these corporations have the money, reach, and data necessary to train their machine learning models accordingly.

Despite this fact, the practice of data augmentation is one method that small companies and businesses can use when looking to train a machine learning algorithm in a more cost-effective fashion. As the name suggests, data augmentation involves making slight augments or alterations to a set of labeled data with the goal of increasing the diversity of the dataset. While these augmentations might be simple and unextraordinary when looked at from a human perspective, machines do not have the same understanding and nuance of the human mind, and as such, software developers can change the data within a dataset to achieve objectives that could incur major costs when using another relevant technique or method.

How does data augmentation work?

To give a basic example of the manner in which data augmentation works, consider a dataset that contains 20 images depicting the facts of cats. This being said, a software developer looking to augment this data could first create copies of these 20 images, and then flip these images horizontally. In the context of a machine learning algorithm, these horizontal images would represent 20 new images to the machine, effectively doubling the amount of training data that the software developer in question could use to train their model. While flipping a group of images horizontally is just one example of the way in which data augmentation can be implemented, datasets can be altered in a number of different ways.

In keeping with the example of a software developer working with a group of 20 images depicting the faces of cats, this software developer could also crop these images as opposed to flipping them horizontally in an attempt to double the amount of training data they can use. What’s more, data augmentation is not limited to the manipulation of the physical properties of a particular image, as adding noise to an image or using the zoom feature on a camera to change the point of view of the image can also suffice.

With all this being said, data augmentation has very much become a standard practice for software engineers that are creating machine and deep learning algorithms for computer vision applications. This is particularly the case for supervised machine learning models, as such an approach is already dependent upon labeled training data. On top of this, the data augmentation approach can also be applied to other types of datasets. For instance, a software developer that is looking to create a speech recognition software program could replace some of the nouns, adjectives, and verbs within their training model with synonyms for these words, giving them the means to increase their training data without having to spend additional resources.

The limitations of data augmentation

In spite of the flexibility that the data augmentation approach provides to software developers and engineers, the inherent nature of machine learning means that there are certain limitations associated with such techniques. To illustrate this point further, consider a software developer that is looking to create a machine learning algorithm that can automatically identify checks. As checks within the U.S. must conform to certain standards that have been established by the U.S. banking and financial industries respectively, flipping a group of checks vertically would not be an efficient way to train a machine learning algorithm to identify said checks, as checks will be formatted horizontally in almost all cases.

Conversely, the data augmentation approach cannot be used to address the innate level of bias that may be contained within a particular dataset. For example, consider a machine learning model that has been created to help a law firm hire new employees. If the dataset that was used to train such a model contained an overwhelming amount names associated with men, such as Mike or Mark, augmenting this data to include similar names such as Maxwell or Miguel would not change the fact that there were very few names associated with women within the dataset. To this point, the practice of data augmentation is best used as a means to enhance the performance of a machine learning algorithm that already has been trained on a diverse set of data.

When used correctly and adequately, data augmentation can be used to greatly improve the ability of a machine learning algorithm to detect and identify objects, words, or information across a wide range of mediums. Furthermore, it is a technique that can also be used to lower the barrier to entry into the fields of deep learning, machine learning, artificial intelligence, and computer vision, as the process of data augmentation is not as expensive or tedious as having to manually label thousands of individual images within a dataset. Likewise, while data augmentation is just one method that software engineers can use to extract additional value from the datasets that are working with, new methods will surely be developed in the near future.