Learning Objectives
- Downstream ML/DL Tasks after Feature Engineering
- Real-world challenges in the industry
- Hands-on example of Movie Recommender System
- Hands-on example of Product Rating Prediction
Learning Resources
Video Transcript
Let's talk about the most important aspect of once I convert text data into numeric representations, how do I use it. So, that is where we can build classification models we can build recommend the systems we can do clustering topic modeling sentiment analysis semantic analysis, and so on. And then you come with a very vague problem statement and that is where you need to understand how to frame the task properly, whether it is classification, any our information extraction and so on. So understanding how to frame the problem is one of the most important challenges first, and then you go into the model training debugging so on and so forth. And some of the challenges here is with real world data. You will see that you may not get 90% accuracy so that is where you need to think about when should you stop training or tuning your models and when it might be ready for deployment. So that is where the whole aspect is you don't know unless you really try it out so the whole thing here is you need to reiterate across different versions of features models and so on, to see what is the best model you get and even deploy and see how it works out. So I will give you a couple of hands on examples this will give you a lot more perspective. So we will look at two use cases here hopefully we have enough time to cover both of them. One is a movie recommender system where we have movie names and the description of the movie, and the idea is based on a new input movie. Can I use the description to recommend similar movies. And the other one is a classification problem where I try to predict whether a product is a good product or not, given the review text. So the workflow for the movie recommender system is to get the movie data set, clean the movie descriptions, build the TF ID of features for each and every description like we discussed we need to convert the description into a vector format. Do the document similarity using cosine similarity which we have used here, we discussed about this pairwise similarity how similar every movie description is to the other movie descriptions and then recommend similar movies based on input movie description. So we'll go into this. And you can read up about content based recommenders and these aspects in more detail that's why these notebooks are there, let me go into the meat of the problem so we load up the data set here which is pandas data frame once we load it up. The key things of our interest are the title, the tagline and the overview. So we combine the tagline and the overview to generate the description as you can see here so the description column of the movies is of our interest where we combine the tagline of the movie and overview. So, the main thing here is given a movie title and a description Can I find out similar movies, based on their descriptions. The first step here is to pre process the documents. So this is where I do some standard regular expressions as we discussed before to keep the numbers and the text, do a lower casing remove extra spaces, so on and so forth. And once we pre process the text data we have around 4800 movies. The next thing is to do the TF IDF feature extraction of these movie descriptions. So what we do here is we tell it that we want the words as well as the bi-grams as a part of our feature vectors, and we said the main DF to do what that means is remove all the words which are occurring only once in all the movie reviews. So basically we're removing words which are occurring only once which means they are not very useful. So as you can see every movie review description ends up being a vector of size, almost a 20,000 odd sized vector. Now the idea is once we have a numeric vector representation for each and every movie description, we can try to find out how similar every movie is to the other movies, that is where we use this cosine similarity function. And then you can see that we get a 4800 cross 4800 sized matrix. Here I'm only showing you five rows, but you'll get the idea where I know that how similar every movie is to the other movies. So once I know this, I can use this information where once I, let's say, take a movie name let's say called Minions. So what I need to first do is find out where in the matrix Minions is occurring. So, as we had seen before that the movie Minions is occurring pretty much in the first row, you can ignore these indices, by the way, once you reset the index basically Minions is occurring in the first row of our data frame or the first row of our matrix. So essentially this vector points to the similarity of all the other movies related to the movie Minions. So, once I take the Minions movie I find out its position. Once I know where it's occurring in the matrix, I extract out the similarity vector. So what this means is just extract out this row from this whole matrix. So now I know all the similar movies related to Minions. So now I know all the similar movies related to Minions. So now I know all the similar movies related to Minions. So now I know all the similar movies related to Minions. So to get the movie indices or the row numbers you can use this nice function called arc sort. And once you use this arc sort function, you use a minus on top of this because I want to sort it in the descending order by default, it starts in the descending order but I want the most similar movies which means movies with the highest similarity scores. So once I do that it tells me that okay these are the movies which are most similar and you can see here that Despicable Me, Despicable Me 2, all of these come out as the most similar movies which makes sense because they are a part of the same franchise. So what you can do here is you can build a generic function here where you can find the movie ID, you extract out the movie similarities from your matrix, you find the top five similar movies and you print out the top five movies and in this way you can experiment with some other movies, some other popular movies in this data set, where you can kind of see these examples and you can see that some of these make sense with regard to them being a part of the same franchise, so on and so forth. So that is where the whole aspect here is using a simple methodology of converting the text description into a TF-IDF vector, taking a similarity matrix computation using a pairwise cosine similarity across each and every movie, and then basically sorting it by the most similar movies based on the highest similarity score, and then showing them you can easily build a content based recommendation system. And now what you can do in the future is instead of TF-IDF you can substitute this with let's say account vectorizer, you can try some other feature engineering techniques like using a fast text model, so on and so forth. So going into the next one, this is a standard classification problem where I take in ecommerce review ratings data set, I do some basic text pre-processing, and then I do some iterative machine learning where I build separate feature sets each time using count based features, sentiment features and bag of words based features, and I try to build a classification model to predict whether the model is giving a good performance in saying whether a product is a good product or a bad product. So a lot of concepts will be similar from the last time, wherein this is my ecommerce reviews data set, and I have converted it into a simple binary classification problem where if the rating is one, it is a good product and if the rating is zero, it is a bad product, just like sentiment analysis. So I'm just removing where we don't have any review content, removing those rows, and now as you can see, it is a highly imbalanced problem where the number of useful products are more and the number of not so useful products or good products are much less. So we do a standard machine learning workflow here where we build a training data set and a test data set, a standard train test split. And then our first feature sets to convert the text data into numeric representations are using counts. So we want to count the number of words which are there in the review. We want to count the number of characters which are there in each review, we want to find out the word density which is the character count divided by the total number of words, want to find out the number of punctuations, the total number of words beginning with an uppercase character, so on and so forth. And once we do this, you can see that for each and every review, we have this numeric representation of different properties of these reviews in terms of the number of words punctuation so on and so forth. Now the idea is can I build a classification model using these numeric representations and can they help me learn good enough patterns to say what is a good review versus a bad review. So this is where we use a logistic regression model as a classification model to build this and we then kind of try to test the performance. So to test the performance first we fit it on our training data. And then we predicted on the test data whatever the model, our train model has and then we check our confusion matrix and some key metrics like precision, recall, F1 score. And as you can see here, our model is essentially predicting everything as class one. So that is why the recall as well as the precision and consequently the F1 score is all zero, which means the model hasn't really learned anything. You give it any review, it will predict it as class one. So this is not at all useful. So what can we do next. Now since we are dealing with text reviews, customers will be pretty emotional or they will express some kind of sentiment in their text. So this is where we use unsupervised lexical sentiment analysis API, which is TextBlob. I'm sure a lot of you have heard about it. So TextBlob looks at a dictionary and kind of figures out that based on different types of adjectives and different types of words, which is basically having a positive sentiment or a negative sentiment, it can rank or score the words in terms of the polarity, which is whether it's a positive polarity or positive sentiment or a negative polarity or sentiment and the subjectivity, which is the level of emotion being exhibited in that document. So what we do is we now append these two new features, the polarity and the subjectivity. So obviously for positively emotional reviews, we have a higher polarity score and a lower polarity score for the negative sentiment reviews. So using these two new features, we again rebuild our model, the same logistic regression model. And now what we can see is we are able to classify a fair number of reviews as the bad reviews. So now our recall is around 27%. So 27% of the total number of bad or negatively rated products we are able to predict and precision is also quite decent around 69%. And the good reviews performance is obviously better because obviously we have more data for it, but our main focus is can we improve the performance of the predictions for the bad reviews because the data is also less for that. So now what we look into next is a standard workflow which we just discussed where we keep only the text alphabets, we remove the extra spaces with regard to the document and as you can see here, we discussed this during the data pre processing and wrangling. I am stemming every word in each document, and then I'm removing the stock words. So some standard pre processing steps I'm doing on each and every document. And once I have the clean review, I'm going to use the bag of words feature. So here I'm using a simple one gram model where every word becomes a feature. So these are my feature vectors as you can see, just like we use TF-IDF before. So I'm using the count vectorizer here and you can see that every word becomes a feature so our feature space explodes from only six to eight features to now having almost 8500 features. So what I do next is I append this to my existing features. So what that means is I have the density and the numeric features related to the properties of the reviews. I have the sentiment related features of the reviews and now I also have the counts related to each and every word across the corpus. So basically bag of words plus sentiment plus density based properties. And considering all these features I now build a model. So as you can see, once I build the logistic regression model on the training data and I predict on the test data, the combination of each set, my performance really improves. So earlier I was able to predict only around, if I look at this, sorry for the scrolling, but earlier I was able to predict only around 27% of the bad reviews, right. So now if I look at this new model, I am able to predict a lot more. So now it's actually reversed so I'm able to predict 70% of the bad reviews, and my precision is also quite decent that 76%. And as you can see the good reviews performance is also boosted so my overall precision comes to 88% F1 score also comes to 88%. So, this is where we just saw an example of reiterating and rebuilding the models right starting with simple text count and density based features and then going into sentiment based features and finally looking at bag of words based features and then trying to build first a baseline model and improve on top of it by adding more features and seeing whether that's helping us improve the performance of the model. So this is a standard workflow which all of you typically should follow when you are going to have an LP application be supervised on supervised.
- Links to the notebook for hands-on practice: