Learning Outcomes

  • Standard NLP Workflow
  • Text Wrangling/Pre-processing
  • Hands-on example of Text Wrangling

Video Transcript

Under natural language processing workflow, which we follow here. So this is a workflow which I had built a while back. Again, this is not like something which I've invented, but something which is adapted from a regular machine learning workflow. So this is where we have text documents, which are first coming into, let's say, our enterprise or to our systems or servers as inputs. So one of the first steps is we need to pre-process these text documents. So this is where we would need to do some level of cleaning of the text data because there would be a fair bit of noise in the text data. So text pre-processing is all about cleaning my noisy text data. Then we would need to do some level of parsing or exploratory data analysis. Again, this is an optional step. This is where we may want to explore how the documents are structured, parse through to extract the key entities from it, so on and so forth. The next one is probably one of the most important steps in the workflow, where we are trying to represent the text and build some feature vectors out of the text data. The reason for this is text data is a natural language. And any kind of a machine learning or deep learning model at heart is an optimization model, let's say, or an approximation model, which is based on principles of mathematics. So it can only work on vectors or matrices or tensors, so numeric data in short. That's why you can't just dump the text data directly into a downstream machine learning or a deep learning model. You have to use some level of representation to transform the text data into a numeric representation, which can then be fed into a downstream machine learning model. So this block is all about representing text data in a numeric format. And once that is done, we basically go into the flow of modeling or mining for patterns. And this is where, if it's a supervised machine learning task, we can use some classification or regression models. Or if it's an unsupervised learning task, we can use something like similarity or clustering, so on and so forth. And obviously, we will be not just building one model, but multiple models here. And once we build all these machine learning or deep learning models, we would want to see which is the best model or which model is performing the best, just like a standard machine learning workflow. And that is where model evaluation comes into perspective. And this is where we would iterate with or reiterate with different feature sets, build different types of models, or maybe tune the models, and then see which is giving us the best performance metrics based on what really matters to the domain or to the business in terms of, let's say, decreasing the number of false positives or focusing more on the true positives, aspects like that. And once we finalize a model, we basically go into what is known as the push to production, where we take in our model, we take in our data transformation pipeline, ETL pipelines, extract transform load, and build a whole system around this and deploy this in production. And of course, things don't end here. There are many more aspects to this, like retraining models over time, checking to see if new features are coming into our data, things like that. But overall, these are the key steps which are followed as a part of any natural language processing application workflow. Something to remember is, again, this is not going to be covered as a part of today's scope since this is more of a fundamental session. But folks who are interested more into, let's say, deep learning for natural language processing, often you would not need to do a whole lot of text pre-processing and you wouldn't need to do a whole lot of extensive data cleaning before you go into, let's say, the feature representation or the modeling phase. You can directly go into the feature representation and modeling phase when working with, let's say, more complex models like transformers, RNNs, LSTMs, and so on. Just something to keep in mind. So let's talk about text wrangling or pre-processing, which is typically one of the first steps once we get our text data. So there are a number of steps here which you can follow. Again, it's not like you have to follow all of them. You can typically restrict it based on the problem you're trying to solve. But some of the main steps here include removing HTML tags, removing extra white spaces, new lines, special character symbols. If you are dealing with accented characters, maybe you want to convert them to an ASCII format, which is a standard format. Stemming or lemmatization, just something to remember here is that we don't want to do both of them. Sometimes people make a mistake where you first do stemming and then do lemmatization. Just keep in mind we should either go with one of them, depending on, again, you may need to try both. But if you want to keep the semantics of the words, you need to go with lemmatization. Removing stop words, again, a common thing where we remove words which are not contributing much overall in terms of meanings. And then we need to do tokenization if it's really needed. Spelling check, grammar checks, again, optional depending on the type of application you're trying to build. So before we go into text representation, maybe I'll just show you how to, some basics of data wrangling or text wrangling. So again, for everyone, this is the link, bit.ly NLP101 underscore DS. So if you go to this URL, it should redirect you to this GitHub repository. So feel free to bookmark it or something so that you can check out the code and the presentation. Now, moving into my first notebook here. Again, try to understand the basic principles behind it. You don't need to understand each and every line of code. It's all available. The idea is to see what is my input, what is my output, what is the process I'm following. So just installing a bunch of dependencies here. So one of the key things could be case conversion. Usually, it's recommended to transform all the text to lowercase. And that is where we have, let's say, if we have a sentence here with the variable called text, we can just do a.lower and it will transform it to lowercase. Similarly, we have uppercase, title case, where it capitalizes every first letter in each word. Now, let's look at tokenization. So assume we have a big document here, by the way. As you can see, it's a pretty big document with multiple sentences. So one thing could be where I want to analyze or look at each sentence separately. So you can use the sentence tokenizer from NLTK here to tokenize and get each of the sentences separately. So if you run sentence tokenizer on a document, you will get a list, a Python list with multiple sentences. If you want to tokenize based on each and every word, you can use the word tokenize function and you will get a list of all the words across all the sentences, basically. Similarly, for folks interested in using spaCy, you can create an NLP construct on your text data. And if you type in object.text for every object in your sentences, if you see here, I'm using the scnts attribute here. So if I use the sense attribute here, I will get the list of sentences pertaining to my document. And if I want to get all the words, just like word tokenize, you can just remove the.sense attribute and you will end up getting all the words as tokens from your document. Next important thing is removing HTML tags and noise. So assuming, let's say we are scraping data from the web, this is where it comes in handy because often when we are extracting data from web pages, what happens is, as you can see, unnecessary HTML tags come into picture where like paragraph tags, emphasis, bold, line breaks, so on and so forth. So the idea is how can I extract the useful text information from this? So that is where we can use something called beautiful soup where you can mention that you want to extract only the text content of it, and you want to remove anything related to script tags or iframe tags. And once you typically run this function, as you can see from this huge block of HTML tags and text, you get only the relevant text content. So this is where we can strip out the HTML tags. And this is a standard thing where I'm removing extra new lines and just keeping one new line per line. Next thing is removing accented characters. So as you can see here, I'll just zoom it slightly maybe in case someone is not able to see, but basically you may have accented characters if you're dealing with some form of text, let's say, and some text documents may contain these accented letters. So the idea is if you want to do feature representation letter, TXT as natural English language and TXT with an accented E may be treated differently. So if you don't want that to happen, if you want to maintain some level of standardization, you can do an ASCII encoding here. And what happens here is after you normalize this using the canonical formats, what will happen is you will get it in natural English language and the accents are removed basically from these characters. The other part is removing special characters, numbers and symbols. So one of the simplest ways would be to just use a regular expression. I'm sure some of you must be aware about this, but this is where we build a pattern and kind of say that anything which, or any character which is between small a to small z or capital A to capital Z, or between zero to nine, the digits, or having a space, keep it as it is, but otherwise remove everything else. And if we want to remove the digits also, that is where we say that only keep everything, which is basically letters and alphabets and of course the spaces. So as you can see, even if I have emojis or special characters and so on, I can call in this function and remove everything. Of course, if I want to keep the numbers, in some cases, numbers are important. You can then run the appropriate regular expression, which is basically this one, so that all the numbers are also kept in. Next thing is contractions. So contractions are basically aspects where we have words with some of the syllables removed and you have typically a contracted format or a shortened format of the word. Like as you can see, I is actually I would, wouldn't is actually would not and so on. So the idea is I need to standardize this so that this is expanded into the two words, which is actually representative of this shortened format. So that is where basically there is a contractions package, which in short is just a dictionary mapping where it maps every contraction to its expanded format. So it is recommended that you fix your text by expanding all the contractions. As you can see, wouldn't gets expanded to would not based on a simple string matching in general or a token matching. And in this way, you standardize your text in the form of removal of the contractions and expanding them. Next part is stemming, where stemming is concerned only about the syntax. It doesn't care about the semantics or the meaning, which means any kind of a stemmer. In this case, we have used the popular Porter stemmer. There are many other stemmers also like Snowball Stemmer, Lancaster Stemmer, and so on. Feel free to check it out. But the basic premise of stemming is it, there are different forms of words due to inflections like jumping, jumps, jumped are all inflections of the same root stem, which is jump. So that is where we want to get back to the root stem. So that even if there are different forms of the same word, we ultimately standardize it to just one word. But this is where it is kind of based on if else rules, like sometimes it will check whether the word is ending in E-D or the word is ending in E and so on. And it will just chop off those or make some transformations. Like as you can see in this case, lying gets transformed to lie, which is the base root stem and strange gets chopped off to S-T-R-A-N-G. So the whole aspect with regard to stemming is we get the root stem, but that may not be a dictionary or a lexicographically correct word. It may not have meaning. So this is something to be careful about if you're trying to build some models where you want to do some interpretable machine learning or things like that, where you want to explain which were the key features, let's say, which led to the model to predict X or Y so on and so forth. And that is where lemmatization comes into picture, which kind of takes care of the semantics also. So this is where it looks at a dictionary. Ultimately, we need to look at a dictionary to make sure that the word which we're shortening to, which is known as the base form or the lemma in case of lemmatization. So a lemma is basically a shortened version of any inflected word, which is available in the dictionary, which means basically which has a meaning. So as you can see here, it uses the WordNet dictionary, NLTK uses the WordNet dictionary to lemmatize or shorten every word, every inflected version or a form of a word into its lemma or the base form. And the parts of speech tag, again, parts of speech, by that I mean things like nouns, verbs, conjunctions, prepositions, adjectives, and so on. So the parts of speech is essential for lemmatization to make a correct shortened form. So as you can see here, cars and boxes are actually nouns. So if you pass in the right parts of speech, it becomes car and box, the shortened version of these inflected forms. Similarly, running eight are verbs. And as you can see correctly, it has made it into run and eat, standardizing the words basically, standardizing different versions of the words. Hi, Anisha. Hello. I'm sorry to interrupt you. Actually I was confused between lemmatization and stemming that can you give some like examples in which case we use stemming and in which case we use lemmatization? Yeah, I'll give the example maybe shortly, but in general, what you need to remember is that stemming, we don't care about the base form, whether it has a meaning or not, it's all about syntax. So in short, stemming is based on a bunch of if else rules, which go through the text based on the kind of inflection, which is there, be it ending in ed, ending in s, things like that. And sometimes it will chop off, sometimes it will do a transformation, but lemmatization will always ensure as long as you pass in the right parts of speech, you will get back the base form, which carries a meaning. As you can see here, strange S-T-R-A-N-G doesn't really have a meaning, but in this case, as long as you are using the right parts of speech, you will end up with a base form, which is having a semantic meaning at the end of the day. And one thing also, yeah, I'll cover that. Don't worry about that. I'll cover it. So now another thing to remember. Hey, Deepankar, one question as well. Yeah. So we saw that for the stemming for strange, it became as strange. What would be the result of a lemmatization for us? Let's say stranger in this case, can we see the result? It's not running here, but you can run it with the right verb. And yeah, because we have only one hour, right? There is so much to cover, but feel free to run this in Colab. If you run it with the verb tag, it will become strange basically. So that's the whole aspect where you will end up, as you can see, fancier becomes fancy. So similarly, stranger would become strange because that is where the base form comes into picture, which is having a semantic meaning. So that is something to remember. And one more thing to remember here is, I will just cover this quickly is, if you have, let's say a full sentence, you would need to transform this first into the paths of speech tags, or where DT is like a determinant, JJ is like a adjective, these are nouns and so on. And then you would need to transform this into the word net post tags because lemmatization works on the word net nomenclature. So as you can see here, every parts of speech tag, you need to shorten it to the right mapping. And once you have this mapping, now let's see if we do it without this mapping, right? So if we just do a word net dot lemmatize word and tag, and here again, as you can see here, just a second. Yeah, if I just do word net dot lemmatize token, by default, the tag is taken as a noun. So as you can see here, sleeping, jumping, these are not getting lemmatized. And the reason is if you don't pass the right parts of speech tag, it will think that, okay, this is a noun, and then it will not lemmatize it correctly. But if you extract out the parts of speech related to each and every word, right? This is the parts of speech tag. And now you lemmatize it with the tag. You can see that sleeping, jumping, these are all getting lemmatized. So just something to remember, and this is a function you can use in the future also, is that whenever you're lemmatizing, the parts of speech tags are essential. Otherwise, the lemmatization won't be effective. And to answer the question again, as to where we use lemmatization stemming, in general, if you have a lot of text data and you don't care about the meanings pertaining to the words, but more about standardization, like text classification, even if you standardize it with stemming, that is fine because we don't care about explaining the meanings, maybe of which are the most important features. But if you want to explain our classification models, let's say with the most important features, that is where, if you want to explain that, okay, these were the features and you want them to be meaningful, then you need to go for lemmatization. But remember, lemmatization is a more expensive operation because it needs to look up into the dictionary. And of course it needs the parts of speech tags. Now, moving into the stop word removal, basically the idea is you have a standard set of stop words, as you can see, which don't carry much meaning, like pronouns, connecting words, conjunctions, things like that. And the idea is to remove these as necessary by just doing a lookup pertaining to each and every word and removing them. That is all stop word removal is. Now you can add new stop words, you can remove new stop words, because at the end of the day, all it is doing is it is going through each and every token in my list of tokens in a sentence and matching it. So that is where, if you want to remove English stop word, or if you want to add in a new stop word, even you can do that.