Text Preprocessing in Natural Language Processing(NLP)

Text Preprocessing in Natural Language Processing(NLP)

Text Preprocessing in Natural Language Processing(NLP)

Text Preprocessing in Natural Language Processing(NLP)

Text Preprocessing in NLP

What is Text Preprocessing in NLP

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen.

Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from numerical data, Text data is available to a great extent which is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.

These various text preprocessing steps are widely used for dimensionality reduction.

The various text preprocessing steps are:

  • Tokenization
  • Lower casing
  • Stop words removal
  • Stemming
  • Lemmatization

These various text preprocessing steps are widely used for dimensionality reduction.

Real "AI Buzz" | AI Updates | Blogs | Education

Tokenization

Splitting the sentence into words.

from nltk.tokenize import word_tokenize

sentence = “Books are on the table”

words = word_tokenize(sentence)
print(words)

Lower casing

Converting a word to lower case (NLP -> nlp).
Words like Book and book mean the same but when not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimensions).

sentence = “Books are on the table.”
sentence = sentence.lower()
print(sentence)

Remove Stopwords

Stopwords are the most commonly occurring words in a text which do not provide any valuable information. stopwords like they, there, this, where, etc are some of the stopwords. NLTK library is a common library that is used to remove stopwords and include approximately 180 stopwords which it removes. If we want to add any new word to a set of words then it is easy using the add method.

In our example, we want to remove the subject words from every mail so we will add them to stopwords and HTTP to remove web links

#remove stopwords

from nltk.corpus import stopwords stop_words = set(stopwords.words(‘english’)) stop_words.add(‘subject’) stop_words.add(‘http’) def remove_stopwords(text): return ” “.join([word for word in str(text).split() if word not in stop_words]) data[‘text’] = data[‘text’].apply(lambda x: remove_stopwords(x))

Stemming and Lemmatization

Stemming is a process to reduce the word to its root stem for example run, running, runs, runed derived from the same word as run. basically stemming do is remove the prefix or suffix from word like ing, s, es, etc. NLTK library is used to stem the words. The stemming technique is not used for production purposes because it is not so efficient technique and most of the time it stems the unwanted words. So, to solve the problem another technique came into the market as Lemmatization. there are various types of stemming algorithms like porter stemmer, snowball stemmer. Porter stemmer is widely used present in the NLTK library.

#stemming

from nltk.stem import PorterStemmer stemmer = PorterStemmer() def stem_words(text): return ” “.join([stemmer.stem(word) for word in text.split()]) df[“text”] = df[“text”].apply(lambda x: stem_words(x))

Read More

Text Preprocessing in Natural Language Processing(NLP)

Leave a Reply

Your email address will not be published. Required fields are marked *

*