Text Preprocessing in Natural Language Processing(NLP)
Text Preprocessing in Natural Language Processing(NLP)
Text Preprocessing in Natural Language Processing(NLP)
Text Preprocessing in NLP
What is Text Preprocessing in NLP
Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen.
Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from numerical data, Text data is available to a great extent which is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.
These various text preprocessing steps are widely used for dimensionality reduction.
The various text preprocessing steps are:
- Tokenization
- Lower casing
- Stop words removal
- Stemming
- Lemmatization
These various text preprocessing steps are widely used for dimensionality reduction.
Real "AI Buzz" | AI Updates | Blogs | Education
Tokenization
Splitting the sentence into words.
Lower casing
Converting a word to lower case (NLP -> nlp).
Words like Book and book mean the same but when not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimensions).
Remove Stopwords
Stopwords are the most commonly occurring words in a text which do not provide any valuable information. stopwords like they, there, this, where, etc are some of the stopwords. NLTK library is a common library that is used to remove stopwords and include approximately 180 stopwords which it removes. If we want to add any new word to a set of words then it is easy using the add method.
In our example, we want to remove the subject words from every mail so we will add them to stopwords and HTTP to remove web links
Stemming and Lemmatization
Stemming is a process to reduce the word to its root stem for example run, running, runs, runed derived from the same word as run. basically stemming do is remove the prefix or suffix from word like ing, s, es, etc. NLTK library is used to stem the words. The stemming technique is not used for production purposes because it is not so efficient technique and most of the time it stems the unwanted words. So, to solve the problem another technique came into the market as Lemmatization. there are various types of stemming algorithms like porter stemmer, snowball stemmer. Porter stemmer is widely used present in the NLTK library.
Leave a Reply