Covariance and Correlation -Understanding using a use case
“Covariance is a measure of how two random variables are related”
It helps us to signify 3 types of relationships:
1. Positive covariance
2. Negative covariance
3. Zero/No covariance
Let us analyze these relationships clearly with an example
Positive covariance
It measures how two random variables move together in increasing order.
Let us consider the example of Distance covered on treadmill vs no of calories burnt. As the distance we walk on a treadmill increases the number of calories burnt increases significantly or vice versa.
Below are the sample observations of above scenario
The plot of these observations follow a positive trend; hence they have positive covariance.
Negative Covariance
Negative covariance indicates the movement of one random variable opposite to another random variable
Let us consider an example of Speed on a treadmill versus time spent. As we increase the speed on the treadmill then the time taken to cover each kilometer decreases.
Below are the sample observations for the above example
The plot of these observation follows Negative trend; hence they have negative covariance.
Zero Covariance
If two random variables are independent then the covariance will be zero. There will not be any trend between random variables.
Let us consider the example for people located at different geographical location vs the no of calories burnt on a treadmill for the same distance and same speed.
The calories burnt will not have any relation w.r.t geographical location
We shall move forward with analyzing the formula and drawbacks of covariance…
The Numerator is telling us the summation of the product of deviation of each x from its mean and deviation of each y from its mean.
Let’s visualize this graphically by considering both positive and negative covariance.
For the positive covariance plot, we could see data points lie in 2 regions.
- In region 1 every point of xi is less than its mean and every point of Yi is also less than its mean. Negative multiplied by negative results in positive value
2. In the second region, both xi and Yi points are greater than its mean, Positive multiplied by positive results in higher positive values.
Finally, the summation of these individual terms in two regions corresponds to Higher positive covariance.
This same logic is applicable for negative covariance as well.
Everything looks fairly well till now, here arises the problem…..
Covariance is sensitive to scale.
Let us consider the above positive trend data of distance covered on treadmill vs no of calories burnt and scale them by a factor of 2
The relationship still holds linear from the above 2 plots but covariance values are changing drastically
Covariance of original data is 30.889
Covariance of scaled data (by a factor of 2) is 123.5566
Hence, we can conclude that the magnitude of the covariance is not easy to interpret because it is not normalized and completely depends on the magnitudes of the variables.
Covariance is not giving any idea about how far our original observations are from the ideal linear relation
Here comes a concept called correlation coefficient which is a normalized version of covariance
Correlation Coefficient
Commonly used correlation coefficients
1. Pearson Correlation coefficient
2. Spearman rank correlation coefficient
Pearson correlation coefficient measures the linear relationship between two random variables.
It has a value between -1 and +1,
1. +1 indicates a complete positive linear correlation,
2. 0 indicates no linear correlation.
3. -1 indicates a complete negative linear correlation
Below are the few images ranging correlation values from -1 to +1
The formula for the Pearson correlation coefficient is as follows:
The Pearson correlation coefficient for both original and scaled data is calculated using the above formula and the value obtained is the same in both the cases 0.9997
The major drawback with the Pearson rank correlation coefficient is its nature to measure only the linear relationship. It would give the correlation as zero for any non-linear patterns.
Spearman rank correlation coefficient: It helps us to assess the relationship between two random variables using a monotonic function.
Same as Pearson correlation coefficient the range of spearman rank correlation coefficient ranges between -1 to +1
Spearman correlation is obtained by assigning the ranks to the input observations.
For our above observations, the spearman correlation coefficient is equal to 1.
Python code to calculate correlation
import pandas as pd
import seaborn as sns
import numpy as np
a = [1,1.5,2,2.5,3,3.5,4,4.5,5]
b = [20,28,37,46,57,66,75,84,93]
df = pd.DataFrame(list(zip(a,b)),columns = ["Distance covered on treadmill","Calories Burnt"])
cor = df.corr()
Visualizing correlation through heatmap
sns.heatmap(cor,annot=True)
from scipy.stats import spearmanr
corr_sp = spearmanr(df["Distance covered on treadmill"],df["Calories Burnt"])
For better understanding, lets analyze the spearman and Pearson correlation coefficient for the below relation.
Here Pearson correlation coefficient only measure the linear relationship and spearman measures monotonic relationship.
Hence Spearman rank correlation performs well than Pearson correlation coefficient