Text Representation Techniques in Natural Language Processing (NLP)

Text Representation Techniques in Natural Language Processing (NLP)

Text Representation in Natural Language Processing (NLP)

Text Representation Techniques in Natural Language Processing (NLP)

What is Text Representation NLP

Text Representation

Text representation is a crucial aspect of NLP that involves converting raw text data into machine-readable form.

In this article, we will explore the different text representation techniques, starting from traditional approaches such as bag-of-words and n-grams to modern techniques like word embeddings.

By the end of this article, you will have a fair understanding of the different text representation techniques along with their strength and weaknesses.

Real "AI Buzz" | AI Updates | Blogs | Education

Bag of Words (BoW) Model:

This is the simplest way to convert unstructured text data into a structured numeric format that can be processed by machine learning algorithms. Each word in the text is considered a feature, and the number of times a particular word appears in the text is used to represent the importance of that word in the text. Disregarding grammar and word order but keeping track of the frequency of each word.

Despite its simplicity, the BoW model provides a foundation for various NLP tasks such as text classification and sentiment analysis.

Example:

Let us consider 3 sentences :

  1. The cat in the hat
  2. The dog in the house
  3. The Bird in the Sky

Below given code displays the above result

from sklearn.feature_extraction.text import CountVectorizer
# Sample sentences
sentences = [“The cat in the hat”,
“The dog in the house”, “The bird in the sky”]
# Create a CountVectorizer object
vectorizer = CountVectorizer()
# Use the fit_transform method to transform the sentences into a bag of words
bow = vectorizer.fit_transform(sentences)
# Print the vocabulary (features) of the bag of words
print(vectorizer.get_feature_names())
# Print the bag of words
print(bow.toarray())

Term Frequency-Inverse Document Frequency (TF-IDF):

TF-IDF is a more refined text representation technique that takes into account not only the frequency of words in a document but also their importance in the entire corpus. It calculates the significance of a word by considering its frequency in the document and inversely proportional to its occurrence across all documents in the corpus. TF-IDF addresses the limitation of BoW by giving higher weight to terms that are rare in the entire corpus yet frequent in specific documents, making it suitable for information retrieval and document similarity tasks.

N-gram

An N-gram is a traditional text representation technique that involves breaking down the text into contiguous sequences of n-words. A uni-gram gives all the words in a sentence. A Bi-gram gives sets of two consecutive words and similarly, a Tri-gram gives sets of consecutive 3 words, and so on. Example: The dog in the house Uni-gram: “The”, “dog”, “in”, “the”, “house” Bi-gram: “The dog”, “dog in”, “in the”, “the house” Tri-gram: “The dog in”, “dog in the”, “in the house”

Word Embeddings: Capturing Semantic Meaning

Word embedding represents each word as a dense vector of real numbers, such that the similar or closely related words are nearer to each other in the vector space. This is achieved by training a neural network model on a large corpus of text, where each word is represented as a unique input and the network learns to predict the surrounding words in the text. The semantic meaning of the word is captured using this. The dimension of these words can range from a few hundred (Glove, Wod2vec)to thousands (Language models).

Below given is the code :

from gensim.models import Word2Vec

# Define the corpus (list of sentences)
corpus = [“The cat jumped”,
“The white tiger roared”,
“Bird flying in the sky”]
corpus=[sent.split(” “) for sent in corpus]
# Train the Word2Vec model on the corpus
model = Word2Vec(corpus, size=50, window=5, min_count=1, workers=2)

# Get the vector representation of a word
vector = model.wv[“cat”]
# Get the top-N most similar words to a given word
similar_words = model.wv.most_similar(“cat”, topn=5)

Sentence Embedding:

It is similar to that of word embedding, the only difference is in place of a word, a sentence is represented as a numerical vector in a high-dimensional space. The goal of sentence embedding is to capture the meaning and semantic relationships between words in a sentence, as well as the context in which the sentence is used.

Below given is the code:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Define a list of sentences to be embedded
sentences = [“The cat jumped”,
“The white tiger roared”,
“Bird flying in the sky”]
# Convert the sentences to TaggedDocuments
tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(sentences)]
# Train a Doc2Vec model on the TaggedDocuments
model = Doc2Vec(tagged_data, vector_size=50, min_count=1, epochs=10)
# Get the embedding for the first sentence
embedding = model.infer_vector(“The white tiger roared”.split())
# Print the resulting embedding
print(embedding)

Text representation in NLP is a captivating journey through the intricacies of human language. From traditional Bag of Words models to state-of-the-art contextual embeddings like BERT and GPT, the evolution of text representation techniques has paved the way for groundbreaking applications.

Read More

Text Representation Techniques in Natural Language Processing (NLP)
Text Representation Techniques in NLP

Leave a Reply

Your email address will not be published. Required fields are marked *

*