Logistic Regression
- Logistic Regression is one of the basic and popular algorithms for solving binary classification problems
- For each input, logistic regression outputs a probability that this input belongs to one of the two classes
- Set a probability threshold boundary that determines which class the input belongs to
- Binary classification problems (2 classes):
- Emails (Spam / Not Spam)
- Credit Card Transactions (Fraudulent / Not Fraudulent)
- Loan Default (Yes / No)
Now, you may ask, why don't we use Linear Regression? Why do we need a new algorithm?
Well, you will find all the answers in the video below.
The video below is a must-watch. The instructor has brilliantly explained about logistic regression!
Linear vs. Logistic Regression
- Linear regression is used to solve regression problems with continuous values
- Logistic regression is used to solve classification problems with discrete categories
- Binary classification (Classes 0 and 1)
- Examples:
- Emails (Spam / Not Spam)
- Credit Card Transactions (Fraudulent / Not Fraudulent)
- Loan Default (Yes / No)
- Let's say a data scientist named John wants to predict whether a customer will buy insurance or not
- Remember that linear regression is used to predict a continuous value where the output (y) may vary between +∞ (positive infinity) to -∞ (negative infinity). In contrast, in this case, the target variable (y) takes only two discrete values, 0 (No insurance) and 1 (Yes, got the insurance).
- John decides to extend the concepts of linear regression to fulfill his requirement. One approach is to take the linear regression output and map it between 0 and 1. If the resultant output is below a certain threshold (say 0.5), classify it as No (didn't buy the insurance), whereas if the resultant output is above a certain threshold, classify it as buying the insurance (yes)
- We then plot a simple linear regression line and set the threshold as 0.5
- Negative class (Insurance = No)– Age on the left side
- Positive class (Insurance = Yes) – Age on the right side
Imagine there is an outlier towards the right
- As we can see, an outlier in the data will distort the whole linear regression line.
- Clearly, the line is unable to differentiate the classes with the linear line fit
- The line should have been at the vertical yellow line, which can divide the positive and negative classes, i.e., yes or no for insurance
Well, life would be much simpler if we had an algorithm that would fit the points like below, right? It is a much better fit compared to the regression line!
Solution
- Solution – Transform linear regression to a logistic regression curve
- Logistic regression is a Sigmoid function
- Now, what does this sigmoid function do?
- Sigmoid function takes in any real value and gives an output probability between 0 and 1
What are we doing in Logistic Regression?
- We will use the real-valued output obtained from a linear regression model between 0 and 1 and classify a new example based on a threshold value. The function used to perform this mapping is the sigmoid function
- The Sigmoid Function is represented by the formula:
- There's no need to go into the depth of how we obtained this formula right now.
Sigmoid Function (Logistic Function/ Logit)
- Take the linear regression function and put it into the Sigmoid function
- Sigmoid function outputs probability between 0 and 1
- Sigmoid function outputs probability between 0 and 1 (y-axis)
- Default probability threshold is set at 0.5 typically
- Class 0 – Below 0.5
- Class 1 – Above 0.5
Types of Logistic Regression
The logistic regression model can be classified into three groups based on the target variable categories:
- Binary Logistic Regression
- The target variable has two possible categories.
- Common examples : 0 or 1, yes or no, true or false, spam or no spam, pass or fail, Transactions (Fraudulent / Not Fraudulent), Medical Condition (Diseased/ Not diseased)
- Multi-Class Logistic Regression
- Multinomial Logistic Regression
- The target variable has three or more categories that are not in any particular order. So, there are three or more nominal categories.
- Examples: Fruits (apple, mango, orange, and banana), profession (e.g., with five groups: surgeon, doctor, nurse, dentist, therapist)
- Ordinal Logistic Regression
- The target variable has three or more ordinal categories. So, there is intrinsic order involved with the categories.
- Student performance can be categorized as poor, average, good, and excellent.
Notebooks for practice
- https://aiplanet.com/notebooks/899/manish_kc_06/basic_logistic_model
- https://aiplanet.com/notebooks/861/manish_kc_06/logistic-regression-advertisement
- https://aiplanet.com/notebooks/862/manish_kc_06/logistic-regression-heart-disease
- https://aiplanet.com/notebooks/891/manish_kc_06/logistic_regression_insurance
Slide Download Link
You can download the slides for this topic from here.