Covariance and Correlation -Understanding using a use case

Navya Paithara

5 min readOct 29, 2020

“Covariance is a measure of how two random variables are related”

It helps us to signify 3 types of relationships:

1. Positive covariance

2. Negative covariance

3. Zero/No covariance

Let us analyze these relationships clearly with an example

Positive covariance

It measures how two random variables move together in increasing order.

Let us consider the example of Distance covered on treadmill vs no of calories burnt. As the distance we walk on a treadmill increases the number of calories burnt increases significantly or vice versa.

Below are the sample observations of above scenario

Distance covered on a treadmill vs Calories burnt

The plot of these observations follow a positive trend; hence they have positive covariance.

Negative Covariance

Negative covariance indicates the movement of one random variable opposite to another random variable

Let us consider an example of Speed on a treadmill versus time spent. As we increase the speed on the treadmill then the time taken to cover each kilometer decreases.

Below are the sample observations for the above example

Speed vs Time required to cover each kilometer

The plot of these observation follows Negative trend; hence they have negative covariance.

Zero Covariance

If two random variables are independent then the covariance will be zero. There will not be any trend between random variables.

Let us consider the example for people located at different geographical location vs the no of calories burnt on a treadmill for the same distance and same speed.

The calories burnt will not have any relation w.r.t geographical location

We shall move forward with analyzing the formula and drawbacks of covariance…

The Numerator is telling us the summation of the product of deviation of each x from its mean and deviation of each y from its mean.

Let’s visualize this graphically by considering both positive and negative covariance.

For the positive covariance plot, we could see data points lie in 2 regions.

In region 1 every point of xi is less than its mean and every point of Yi is also less than its mean. Negative multiplied by negative results in positive value

2. In the second region, both xi and Yi points are greater than its mean, Positive multiplied by positive results in higher positive values.

Finally, the summation of these individual terms in two regions corresponds to Higher positive covariance.

This same logic is applicable for negative covariance as well.

Everything looks fairly well till now, here arises the problem…..

Covariance is sensitive to scale.

Let us consider the above positive trend data of distance covered on treadmill vs no of calories burnt and scale them by a factor of 2

Positive trend data scaled by a factor 2

The relationship still holds linear from the above 2 plots but covariance values are changing drastically

Covariance of original data is 30.889

Covariance of scaled data (by a factor of 2) is 123.5566

Hence, we can conclude that the magnitude of the covariance is not easy to interpret because it is not normalized and completely depends on the magnitudes of the variables.

Covariance is not giving any idea about how far our original observations are from the ideal linear relation

Here comes a concept called correlation coefficient which is a normalized version of covariance

Correlation Coefficient

Commonly used correlation coefficients

1. Pearson Correlation coefficient

2. Spearman rank correlation coefficient

Pearson correlation coefficient measures the linear relationship between two random variables.

It has a value between -1 and +1,

1. +1 indicates a complete positive linear correlation,

2. 0 indicates no linear correlation.

3. -1 indicates a complete negative linear correlation

Below are the few images ranging correlation values from -1 to +1

The formula for the Pearson correlation coefficient is as follows:

The Pearson correlation coefficient for both original and scaled data is calculated using the above formula and the value obtained is the same in both the cases 0.9997

The major drawback with the Pearson rank correlation coefficient is its nature to measure only the linear relationship. It would give the correlation as zero for any non-linear patterns.

Spearman rank correlation coefficient: It helps us to assess the relationship between two random variables using a monotonic function.

Same as Pearson correlation coefficient the range of spearman rank correlation coefficient ranges between -1 to +1

Spearman correlation is obtained by assigning the ranks to the input observations.

Calculation of spearman rank correlation coefficient

For our above observations, the spearman correlation coefficient is equal to 1.

Python code to calculate correlation

import pandas as pd
import seaborn as sns
import numpy as np
a = [1,1.5,2,2.5,3,3.5,4,4.5,5]
b = [20,28,37,46,57,66,75,84,93]
df = pd.DataFrame(list(zip(a,b)),columns = ["Distance covered on treadmill","Calories Burnt"])

cor = df.corr()

Visualizing correlation through heatmap

sns.heatmap(cor,annot=True)

from scipy.stats import spearmanr
corr_sp = spearmanr(df["Distance covered on treadmill"],df["Calories Burnt"])

For better understanding, lets analyze the spearman and Pearson correlation coefficient for the below relation.

Here Pearson correlation coefficient only measure the linear relationship and spearman measures monotonic relationship.

Hence Spearman rank correlation performs well than Pearson correlation coefficient

Comparison between spearman and Pearson correlation coefficient

References

https://en.wikipedia.org/wiki/Covariance

https://www.youtube.com/watch?v=qtaqvPAeEJY