Calculating Pearson correlation and significance in Python

Question

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

score 216 · Answer 1 · edited Jun 26 '23 at 12:49

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

Output:

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

@user702846 Pearson correlation is defined on a 2xN matrix. There is no generally applicable method that converts two dictionaries into a 2xN matrix, but you might use the array of pairs of dictionary values corresponding to the keys of the intersection of the keys of your dictionaries. — winerd, Aug 18 '15 at 09:23

score 121 · Answer 2 · edited Jun 26 '23 at 12:39

121

The Pearson correlation can be calculated with NumPy's corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]

edited Jun 26 '23 at 12:39

Peter Mortensen

30,738
21
105
131

answered Apr 16 '13 at 00:17

winerd

1,813
1
14
12

1

the output is confusing, but actually very simple. check this explanation https://stackoverflow.com/a/3425548/1245622 – linqu Dec 20 '17 at 09:21
7

This does not produce the requested significance of the correlation right? – Olivier Aug 06 '20 at 07:30

score 61 · Answer 3 · edited Jun 26 '23 at 12:53

61

An alternative can be a native SciPy function from linregress which calculates:

slope : slope of the regression line

intercept : intercept of the regression line

r-value : correlation coefficient

p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero

stderr : Standard error of the estimate

And here is an example:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

will return you:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

edited Jun 26 '23 at 12:53

Peter Mortensen

30,738
21
105
131

answered Nov 21 '15 at 13:10

Salvador Dali

214,103
147
703
753

2

Great answer - by far the most informative. Also works with a two-row pandas.DataFrame: `lineregress(two_row_df)` – dmeu Mar 09 '16 at 10:09
Brilliant answer. Very intuitive too, if you think of it – pseudomonas Nov 07 '19 at 19:40
Watch out: `stderr` is the standard error for the slope, not the Pearson correlation coefficient. – MRule Feb 26 '23 at 19:06

score 39 · Answer 4 · edited Jun 26 '23 at 12:37

39

If you don't feel like installing SciPy, I've used this quick hack, slightly modified from Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

edited Jun 26 '23 at 12:37

Peter Mortensen

30,738
21
105
131

answered Apr 19 '11 at 08:52

Jeff Hammerbacher

4,226
2
29
36

2

I was surprised to discover this disagrees with Excel, NumPy, and R. See http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python/7939259#7939259. – dfrankow Oct 29 '11 at 13:48
2

As another commenter pointed out, this has a float/int bug. I think sum_y/n is integer division for ints. If you use sum_x = float(sum(x)) and sum_y = float(sum(y)), it works. – dfrankow Oct 30 '11 at 20:05
@dfrankow I think it's because imap cannot handle float. python gives an `TypeError: unsupported operand type(s) for -: 'itertools.imap' and 'float'` at `num = psum - (sum_x * sum_y/n)` – alvas Jan 24 '13 at 14:17
4

As a style note Python frowns on this unnecessary usage of map (in favor of list comprehensions) – Maxim Khesin Nov 22 '13 at 20:58
16

Just as a comment, consider that libraries as scipy et al are developed by people knowing a lot of numerical analysis. This may avoid you a lot of common pitfalls (for instance, having very large and very little numbers in X or Y may result in catastrofic cancellation) – finiteautomata Sep 30 '14 at 22:28

score 31 · Answer 5 · edited Mar 09 '14 at 03:04

31

The following code is a straight-up interpretation of the definition:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

Test:

print pearson_def([1,2,3], [1,5,7])

returns

0.981980506062

This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

R:

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

EDIT: Fixed a bug pointed out by a commenter.

edited Mar 09 '14 at 03:04

drevicko

14,382
15
75
97

answered Oct 29 '11 at 13:42

dfrankow

20,191
41
152
214

4

Beware of the type of the variables! You have encountered an int/float problem. In `sum(x) / len(x)` you divide ints, not floats. So `sum([1,5,7]) / len([1,5,7]) = 13 / 3 = 4`, according to integer division (whereas you want `13. / 3. = 4.33...`). To fix it rewrite this line as `float(sum(x)) / float(len(x))` (one float suffices, as Python converts it automatically). – Piotr Migdal Oct 30 '11 at 19:46
Your code won't work for cases like: [10,10,10],[0,0,0] or [10,10],[10,0]. or even [10,10],[10,10] – madCode May 18 '12 at 16:56
4

The correlation coefficient is not defined for any of those cases. Putting them into R returns "NA" for all three. – dfrankow May 18 '12 at 17:55

score 28 · Answer 6 · answered Jan 29 '17 at 11:28

28

You can do this with pandas.DataFrame.corr, too:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

This gives

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

answered Jan 29 '17 at 11:28

Martin Thoma

124,992
159
614
958

11

This is just correlation without significance – Ivelin Apr 28 '17 at 20:17

score 11 · Answer 7 · edited Jun 26 '23 at 12:41

Rather than rely on NumPy or SciPy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson correlation coefficient (PCC).

import math

# Calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x)

# Calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# Calculates the PCC using both the two functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x:
        scorex.append((i - mean(x))/sampleStandardDeviation(x))

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# Multiplies both lists together into 1 list (hence zip) and sums the whole list
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

The significance of PCC is basically to show you how strongly correlated the two variables/lists are.

It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.

It has amazing complexity and slow performance on 2 lists with 500+ values. — Nikolay Fominyh, Feb 07 '16 at 13:58

score 11 · Answer 8 · edited Jun 26 '23 at 12:54

Pearson coefficient calculation using Pandas in Python:

I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the Python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis

Example:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.

Hello, welcome to StackOverflow! Try adding a short description of why you chose this code and how it applies in this case at the start of your answer! — Tristo, Sep 09 '18 at 15:19

score 7 · Answer 9 · edited Jun 26 '23 at 12:44

7

Hmm, many of these responses have long and hard-to-read code...

I'd suggest using NumPy with its nifty features when working with arrays:

import numpy as np

def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

edited Jun 26 '23 at 12:44

Peter Mortensen

30,738
21
105
131

answered Oct 31 '13 at 15:28

Martin F Thomsen

109
1
3

Although I like very much this answer, I would advise to copy/clone both X and Y inside the function. Otherwise both are altered, which might not be a desired behaviour. – antonimmo Jun 27 '17 at 17:56

score 6 · Answer 10 · edited Jun 26 '23 at 12:50

Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using Numba.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

There isn't anyone by the user name "mkh'" here. What answer does it refer to? — Peter Mortensen, Jun 26 '23 at 12:49

score 5 · Answer 11 · edited Jun 26 '23 at 12:46

5

This is an implementation of the Pearson correlation function using NumPy:

def corr(data1, data2):
    "data1 & data2 should be NumPy arrays."
    mean1 = data1.mean()
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

    #corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr

edited Jun 26 '23 at 12:46

Peter Mortensen

30,738
21
105
131

answered Mar 02 '15 at 21:13

MojiProg

1,992
1
16
8

score 3 · Answer 12 · edited Jun 26 '23 at 12:42

Here is an implementation for Pearson correlation based on a sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length, but over all vector sizes, it will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    # Calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    # Adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

Unit tests:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

aumpen · Answer 13 · 2018-06-22T22:21:37.850

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

score 3 · Answer 14 · edited Jun 26 '23 at 12:55

3

Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library:

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

edited Jun 26 '23 at 12:55

Peter Mortensen

30,738
21
105
131

answered May 13 '21 at 16:47

Xavier Guihot

54,987
21
291
190

score 1 · Answer 15 · answered Jan 09 '13 at 02:18

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!

It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

score 0 · Answer 16 · edited Jun 26 '23 at 12:56

Calculating Correlation:

Correlation - measures similarity of two different variables

Using Pearson correlation

from scipy.stats import pearsonr

# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# Print correlation of n*n column

Using Spearman correlation

from scipy.stats import spearmanr

# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# Print correlation of n*n column

Using Kendall correlation

kendall_correlation=final_data.corr(method='kendall')
kendall_correlation

score -1 · Answer 17 · answered Nov 28 '22 at 01:27

def correlation_score(y_true, y_pred):
    """Scores the predictions according to the competition rules. 
    
    It is assumed that the predictions are not constant.
    
    Returns the average of each sample's Pearson correlation coefficient"""
    
    y2 = y_pred.copy()
    y2 -= y2.mean(axis=0);    y2 /= y2.std(axis=0) 
    y1 = y_true.copy(); 
    y1 -= y1.mean(axis=0);    y1 /= y1.std(axis=0) 
        
    c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average 
        
    return c

score -2 · Answer 18 · answered Dec 20 '17 at 11:19

-2

def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

answered Dec 20 '17 at 11:19

Source

1

1

Code-only answers are not considered good practice. Please consider adding a few words to explain how your code addresses the question. (read the help page on how to answer a question on SO) – Yannis Dec 20 '17 at 11:28

Calculating Pearson correlation and significance in Python

18 Answers18

Calculating Correlation:

Using Pearson correlation

Using Spearman correlation

Using Kendall correlation

Linked

Related