Earn 20 XP


Learning Objectives

  • Scatter Plot
  • Outliers
  • Correlation in Data Science

Scatter Plot

Scatter plots are used for interpreting trends in data. Below is an example of a scatter plot between temperature and ice cream sales in dollars. What is the trend in this scatter plot? Roughly we can say that as temperature increases, ice cream sales increase.

image.png

Outlier

In statistics, an outlier is a data point that differs significantly from other observations.

image.png

Correlation

Correlation is a statistical measure.

It measures the strength of a linear relationship between two quantitative variables.

Now you may ask, what is a variable? - If we go back to the scatter plot example: temperature and ice-cream sales are variables. Variable is often interchangeably used as features too.

Target variable - In data science, The "target variable" is the variable whose values are to be modeled and predicted by other variables in the dataset.

Importance of Correlation

Every successful data science project revolves around finding accurate correlations between the input and target variables. However, more often than not, we oversee how crucial correlation analysis is.

It is recommended to perform correlation analysis before and after a data science project's data gathering and transformation phases.

Positive Correlation

Two features (variables) can be positively correlated with each other. It means that when the value of one variable increases, the value of the other variable(s) also increases (also decreases when the other decreases).

image.png

Real-life examples:

  • The more time you spend running on a treadmill, the more calories you burn.
  • As the temperature goes up, ice cream sales also go up.
  • As the water level decreases in a fish tank, the fish's habitat volume decreases.

Negative Correlation

Two features (variables) can be negatively correlated with each other. This occurs when the value of one variable increases and the value of the other variable(s) decreases (inversely proportional).

image.png

Real-life examples:

  • As the weather gets colder, air conditioning costs decrease.
  • The more vitamins one takes, the less likely one is to have a deficiency.
  • The more one works, the less free time one has.

Zero/No Correlation

Two features might not have any relationship with each other. This happens when the value of a variable is changed, then the value of the other variable is not impacted.

image.png

Real-life examples:

  • There is no relationship between the amount of tea drunk and the level of intelligence.
  • It was raining this morning, and the grocery store was out of bananas.
  • The temperature on Mars and the stock market have an almost zero correlation because the stock market price will not depend on the temperature on Mars.

Notebook and Dataset

Dataset Description

It contains data from 99 standard metropolitan areas in the US. The data set provides information on ten variables for each area from 1976 to 1977. The areas have been divided into four geographic regions: 1=North- East, 2=North-Central, 3=South, 4=West. The variables provided are listed in the table below:

image.png

You can download the slides for this topic from here.