Text Augmentation Techniques: A New Portal of Possibilities
In this era of information technology where machines can actually ‘see’ and ‘understand’ to an extent, implementation of data as well as text augmentation techniques for boosting a model’s performance becomes a must.
Data augmentation which is popularly used in the field of computer vision refers to the generating augmented image. This augmented image is developed by model which classifies the image even with noise or cropped image. For example: While working with images, new augmented image can be created through performing of operations like flipping, rotation, translation, scaling up or down etc.
However application of text augmentation in NPL (Natural Language Processing) is quiet complex and challenging, because change of one word may lead to changing the entire meaning of a sentence. Also, not every word has a synonym. Hence its scope is quiet limited, when it comes to implementation.
But given the fact that it is very much possible and in practice, following are some commonly used text augmentation techniques, which are employed in order to generate new data for model training purposes.
This is one of the simplest methods of text data augmentation wherein, only the order of words in a sentence is changed to create a new sentence. In other words, shuffling of text elements takes place for creating new text. But it must be considered that the classification algorithms should be sensitive to the order of the words.
This method makes use of paraphrasing, in which algorithm identifies the most similar word as an alternative for original word.
Transforming Syntax Tree
In this method syntactic paraphrases of the source sentence are generated. Different kinds of syntactic transformations such as replacing active to passive voice, using noun instead of pronoun, adding adjectives and adverbs etc, can be used to create paraphrases. For example:
- The smile broke his composure.
- His composure was broken by the smile.
Query expansion focuses on finding new sentences similar to a given sentence by considering this latter sentence as query and searching for its answer. This is done through using a generic or domain specific search engine by defining original sentence as a query. Then the most relevant search results are selected as new sentences, which would inherit the same tag as the original sentence (the query).
Spelling Errors Injection
The main objective of this technique is to generate texts containing common misspellings for the purpose of training the model so as to make them more robust for this type of textual noise. This algorithm is based on a list of the most common misspellings in English.
Back-translation is when a target language is translated into source language. It is a combination of both, the original source sentences and back-translated sentences for training a model. The result of the back-translation is filtered to recover the paraphrases. If the back-translation is found identical to the original sentence, it is immediately rejected. If not, a similarity measure is done between the back-translated text and the original text. This is done in order to find out whether the back-translation can be considered as a paraphrase of the original text.
To leverage the full potential of these techniques, the data augmentation design must be based on a deep understanding of both, document’s structure and its content.
Although text data augmentation techniques are passing through a phase of various radical developments, still it can’t be denied that espousing and experimenting with them can substantially boost the performance of the training model.