Earn 20 XP


Correlation:

Correlation is a statistical measure. Data correlation is a way to understand the relationship between multiple values or features in your dataset.

Every single successful data science project revolves around finding accurate correlations between the input and target variables. However more than often, we oversee how crucial correlation analysis is. 

It is recommended to perform correlation analysis before and after data gathering and transformation phases of a data science project.

 There are three different types of correlations:

  1. Positive Correlation: Two features (variables) can be positively correlated with each other. It means that when the value of one variable increases then the value of the other variable(s) also increases (also decreases when the other decreases).
    Eg. The more time you spend running on a treadmill, the more calories you will burn.
  2. Negative Correlation: Two features (variables) can be negatively correlated with each other. This occurs when the value of one variable increases and the value of another variable(s) decreases (inversely proportional).
    Eg. As the weather gets colder, air conditioning costs decrease.
  3. No Correlation: Two features might not have any relationship with each other. This happens when the value of a variable is changed then the value of the other variable is not impacted.
    Eg. There is no relationship between the amount of tea drunk and level of intelligence.
  • Each of these correlation types exists in a spectrum represented by values from -1 to +1 where slight or high positive correlation features can be like 0.5 or 0.7.
  • A very strong and perfect positive correlation is represented by a correlation score of 0.9 or 1.
  • If there is a strong negative correlation, it will be represented by a value of -0.9 or -1. Values close to zero indicates no correlation.

We can check how each feature is related to others using corr() function.

 

 

Creating a pictorial visualisation of the above correlation matrix using a heatmap helps in better understanding. We can do that using Seaborn's Heatmap function.

Observations:

  • Alcohol has the highest positive correlation with wine quality, followed by the various other variables such as acidity, sulphates, density & chlorides.
  • There is a relatively high positive correlation between fixed_acidity and citric_acid, fixed_acidity and density.
  • There is a relatively high negative correlation between fixed_acidity and pH.
  • Density has a strong positive correlation with fixed_acidity, whereas it has a strong negative correlation with alcohol.
  • citric acid & volatile acidity have negative correlation.
  • free sulphur dioxide & total sulphur dioxide have positive correlation.