Learning Outcomes

  • Traditional Text Representation Models
  • Hands-on examples of Text Representation Models (Statistical)
  • Word Embedding Models with hands-on examples

Learning Resources

Video Transcript

So the next part is representation of text data. So this is where we need to transform the text data into some kind of a numeric format. So the most common methods include bag of words, where every document is represented by a vector, where each column in the vector is typically one word. And the values of the vectors are basically the number of times every word is occurring in that document. We'll see a short example of this soon. The bag of n-grams is similar to the bag of words, where we have a vector of numbers. And the only thing is, instead of just single words, we can also have a collection of words which are occurring in sequence. Like, if we have a collection of every two words occurring in sequence, that is known as a two-gram or a bigram model. So in bag of n-grams, we just don't have only words, but also a collection of n-grams in the vector representation. TF-IDF is basically a normalized version of the bag of words or the term frequency model, where instead of showing the number of times every word is occurring in each and every document, we normalize it by scaling with the inverse document frequency, which is basically the number of times a word is occurring across all the documents. So you can imagine that if a word is occurring in almost all the documents, it essentially becomes like a stop word or not a very useful word. So if you're divided by the inverse document frequency, or actually you multiply the TF by the inverse document frequency, since the IDF is so high, it will end up reducing the weight of those words which are occurring in each and every document. So that is where TF-IDF is used to downplay the effect of frequently occurring words. And the last one is more of a derived feature from bag of words or any other similar features, like embeddings in the future, where you kind of assign score to each and every document with regard to how similar it is to other documents. So based on the base bag of words or the TF-IDF vectors, you can assign scores to every document as to how similar it is to each other document, also known as the pay-by similarity. Let's quickly look at an example of this. So again, this is just a sample corpus. Considering the time at hand, I won't walk through each and every line of code. But feel free to refer to this. This is just building a standard data frame where we have these documents. Now, the whole idea is how do I represent each of these documents into a numeric format. So for that, I do some basic text preprocessing, as you can see. All the text for the documents are lowercase. The stop words are removed, and so on. And then I apply the bag of words model, which is called count vectorizer in terms of the scikit-learn APIs which are available. So if I apply the count vectorizer on every document, this is how it looks like. So you can see that every word is a part of my vector. So this is the vector. Sorry. This first row is the vector representation of my first document. And as you can see, the first document was all about the sky is blue and beautiful. So this got preprocessed into sky, blue, beautiful. So these are the three words which are occurring once. So in my vector representation, as you can see, beautiful, blue, and sky are occurring once. All the others are zero. So basically, it will be like a sparse matrix where only the words which are occurring in that document will be marked as a non-zero value, depending on the number of times that word is occurring. And all the other words will become zero. So this essentially becomes like a numeric matrix, which you can then feed into a downstream machine learning application. For the bag of n-grams model, what happens is we can, instead of looking at just words, we can look at sequence of words. So beautiful sky occurs in sequence in the second document. That is why it is marked as one. So the number of times every n-gram is occurring in the document, we basically put that count in place of that n-gram. And we have similar vectors again. Something to remember is as you increase the number of n-grams, the dimensionality of your vectors also starts increasing. So just something to keep in mind, because you don't want to get affected by the curse of dimensionality, which can overfit your models. And like I said, for TF-IDF, what happens is we basically multiply it by this inverse document frequency. And this inverse document frequency is the log transform of the total number of documents in the corpus divided by the document frequency of the word. So obviously, if a word is occurring in each and every document, the IDF ends up really being a small number. So if you multiply it by the term frequency, as you can see that we start getting floating point numbers, which are essentially normalized versions of the counts. So TF-IDF is sometimes useful when you want to downplay the effects of words which are occurring very frequently in the corpus. And document similarity is where we can use these vectors which we get and try to say that, OK, how similar is document 0 to document 1 based on these vectors? And once you do that, you typically get like a pairwise similarity matrix where you can kind of say that for document 0, how similar are all the other documents? In our case, we have around eight documents. So you can see here that for document 0 here, obviously, document 0 will be the most similar to it because it is exactly the same. But what we are concerned about is all the others. So from this, we can easily get insights that document 0, 1, and document 6 are very similar to each other. So if you see here, document 0, document 1, and document 6 are all related to weather. So here I have used a basic clustering algorithm called k-means. And if you pass this similarity matrix to k-means, what it helps you do is it helps you group the documents. So you can see that even if I didn't have the category, I would know that, OK, which are the similar documents which are occurring close to each other. So this is where you can start using these features in downstream machine learning applications like clustering, classification, and so on. And we'll see an example of this soon in the hands-on tutorial. Going into the next one, this is something you can dive into more detail when you get time. But this is where we leverage word embedding models, where we use some kind of neural network back-end models to generate these text representations, where instead of just doing a count based on every word, we kind of look at each word and the surrounding words to generate dense vector representations of each and every word. And there are various variants to this. The most popular ones include Word2Vec, which was invented by Google a while back. And then we have GloVe, which is called Global Vectors from Stanford, where it uses a similar but yet different mechanism in the form of matrix factorization. So it uses a matrix factorization technique, in short, to obtain dense word representations. And FastText basically uses a bag of character n-grams to represent these vectors in the form of dense vectors every word. So in Word2Vec, every word is represented as a single entity. But FastText takes that one step further by considering a word as a combination of sub-words. So you can see that the vector for where would be a combination of all of these character n-grams averaged together. So each of these character n-grams will have their own vector representation. And then you can combine them to get the representation of where. And this really helps us when we have out-of-vocabulary words, where you can combine vector representation for that word using character n-grams. I'm not covering this in too much detail due to the complexity of this, and also keeping in line with the time which I have left. But if you are interested, I would recommend you to go through this notebook, where, again, I'm using the same documents here. But you will basically be able to see that I use the Gensimple method. You can see that I use the Gensim package here to generate a Word2Vec model. And essentially, what happens here is, since I can generate a vector representation of each word, you can see that all the adjectives and animals and here all the things related to breakfast and all the things related to sky, beautiful blue, they are all grouped together. And the reason is these are the vectors, which is the output of the Word2Vec model. In my case, I had told that the vector representation for every word should be of a size 15, a vector of size 15. So every word here is represented by a vector of numbers of size 15. And the whole concept of Word2Vec is similar words, like let's say sky, blue, and beautiful, will have vectors which are very similar to each other. And that is where you can use document similarity again, like we saw before. And we can compute the similarity of each word to other words. And you can see like sky, blue, and beautiful are having a really high similarity versus other things. And the whole reason is because these numeric representations have been trained using a neural network in the back end. Something to check out in the future, once you read up more on feature representations using embedding models.

Text Vectorization

textve.JPG

Sample Code illustrating vector representation of text using unigram(Bow)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from sklearn.feature_extraction.text import CountVectorizer texts = [ "basketball is a team sport where teams shoot a basketball", "football is a sport where teams score goals" ] # fit vectorizer on texts vectorizer = CountVectorizer(ngram_range=(1, 1)) vectorizer.fit(texts) # build ngram dictionary # vectorize texts into bag of words ngrams = vectorizer.transform(texts) ngrams.todense() ## Output matrix([[2, 0, 0, 1, 0, 1, 1, 1, 1, 1], [0, 1, 1, 1, 1, 0, 1, 0, 1, 1]])
  • ngram_range=(1,1) represents a unigram

Once the vectorized is fit, it can be used to transform text into vectors with the transform method. The result is a scipy sparse matrix, which we can easiliy visualize by transforming it into a dense matrix with its todense method.

dense.JPG

1 2 # show the vocabulary learned by the vectorizer vectorizer.vocabulary_

Output

outputve.JPG
Using the info in the vocabulary of the vectorizer, we can obtain a pandas dataframe that shows how many times each word has been counted in each text.

create a pandas dataframe that shows the unigram

1 2 3 4 5 import pandas as pd sorted_cols = sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1]) features = [col[0] for col in sorted_cols] df = pd.DataFrame(ngrams.todense(),columns=features) print(df)

df.JPG

Bigram

A bigram is made of two consecutive words, such as “score goals” in the sentence “football is a sport where teams score goals”. Bigrams may be necessary to grasp concepts expressed by multiple consecutive words like “New York” or “American football”. To do so, we simply create a CountVectorizer with ngram_range=(1,2), which means that both unigrams and bigrams will be counted.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 # fit vectorizer on texts vectorizer = CountVectorizer(ngram_range=(1, 2)) vectorizer.fit(texts) # build ngram dictionary # vectorize texts into bag of words ngrams = vectorizer.transform(texts) ngrams.todense() #Output ''' matrix([[2, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1]]) ''' # show the vocabulary learned by the vectorizer vectorizer.vocabulary_

Extracted Bigrams

bigram.JPG

Create a Pandas DataFrame that shows a bi-gram

1 2 3 4 sorted_cols = sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1]) features = [col[0] for col in sorted_cols] df = pd.DataFrame(ngrams.todense(),columns=features) print(df)

bidf.JPG

Additional Resources

We recommend you to go through the following article that gives a comprehensive overview about Text Representation for NLP: