223

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
ariel
  • 2,239
  • 2
  • 14
  • 3

18 Answers18

216

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

Output:

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sacha
  • 3,818
  • 2
  • 16
  • 10
  • 2
    How about correlation coefficient of two dictionaries ?! – Areza May 22 '13 at 23:49
  • 2
    @user702846 Pearson correlation is defined on a 2xN matrix. There is no generally applicable method that converts two dictionaries into a 2xN matrix, but you might use the array of pairs of dictionary values corresponding to the keys of the intersection of the keys of your dictionaries. – winerd Aug 18 '15 at 09:23
121

The Pearson correlation can be calculated with NumPy's corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
winerd
  • 1,813
  • 1
  • 14
  • 12
  • 1
    the output is confusing, but actually very simple. check this explanation https://stackoverflow.com/a/3425548/1245622 – linqu Dec 20 '17 at 09:21
  • 7
    This does not produce the requested significance of the correlation right? – Olivier Aug 06 '20 at 07:30
61

An alternative can be a native SciPy function from linregress which calculates:

slope : slope of the regression line

intercept : intercept of the regression line

r-value : correlation coefficient

p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero

stderr : Standard error of the estimate

And here is an example:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

will return you:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
39

If you don't feel like installing SciPy, I've used this quick hack, slightly modified from Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jeff Hammerbacher
  • 4,226
  • 2
  • 29
  • 36
  • 2
    I was surprised to discover this disagrees with Excel, NumPy, and R. See http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python/7939259#7939259. – dfrankow Oct 29 '11 at 13:48
  • 2
    As another commenter pointed out, this has a float/int bug. I think sum_y/n is integer division for ints. If you use sum_x = float(sum(x)) and sum_y = float(sum(y)), it works. – dfrankow Oct 30 '11 at 20:05
  • @dfrankow I think it's because imap cannot handle float. python gives an `TypeError: unsupported operand type(s) for -: 'itertools.imap' and 'float'` at `num = psum - (sum_x * sum_y/n)` – alvas Jan 24 '13 at 14:17
  • 4
    As a style note Python frowns on this unnecessary usage of map (in favor of list comprehensions) – Maxim Khesin Nov 22 '13 at 20:58
  • 16
    Just as a comment, consider that libraries as scipy et al are developed by people knowing a lot of numerical analysis. This may avoid you a lot of common pitfalls (for instance, having very large and very little numbers in X or Y may result in catastrofic cancellation) – finiteautomata Sep 30 '14 at 22:28
31

The following code is a straight-up interpretation of the definition:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

Test:

print pearson_def([1,2,3], [1,5,7])

returns

0.981980506062

This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

R:

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

EDIT: Fixed a bug pointed out by a commenter.

drevicko
  • 14,382
  • 15
  • 75
  • 97
dfrankow
  • 20,191
  • 41
  • 152
  • 214
  • 4
    Beware of the type of the variables! You have encountered an int/float problem. In `sum(x) / len(x)` you divide ints, not floats. So `sum([1,5,7]) / len([1,5,7]) = 13 / 3 = 4`, according to integer division (whereas you want `13. / 3. = 4.33...`). To fix it rewrite this line as `float(sum(x)) / float(len(x))` (one float suffices, as Python converts it automatically). – Piotr Migdal Oct 30 '11 at 19:46
  • Your code won't work for cases like: [10,10,10],[0,0,0] or [10,10],[10,0]. or even [10,10],[10,10] – madCode May 18 '12 at 16:56
  • 4
    The correlation coefficient is not defined for any of those cases. Putting them into R returns "NA" for all three. – dfrankow May 18 '12 at 17:55
28

You can do this with pandas.DataFrame.corr, too:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

This gives

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
11

Rather than rely on NumPy or SciPy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson correlation coefficient (PCC).

import math

# Calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x)

# Calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# Calculates the PCC using both the two functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x:
        scorex.append((i - mean(x))/sampleStandardDeviation(x))

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# Multiplies both lists together into 1 list (hence zip) and sums the whole list
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

The significance of PCC is basically to show you how strongly correlated the two variables/lists are.

It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
compski
  • 709
  • 3
  • 13
  • 28
11

Pearson coefficient calculation using Pandas in Python:

I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the Python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis

Example:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Web Ster
  • 119
  • 1
  • 4
  • Hello, welcome to StackOverflow! Try adding a short description of why you chose this code and how it applies in this case at the start of your answer! – Tristo Sep 09 '18 at 15:19
7

Hmm, many of these responses have long and hard-to-read code...

I'd suggest using NumPy with its nifty features when working with arrays:

import numpy as np

def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • Although I like very much this answer, I would advise to copy/clone both X and Y inside the function. Otherwise both are altered, which might not be a desired behaviour. – antonimmo Jun 27 '17 at 17:56
6

Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using Numba.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Turtles Are Cute
  • 3,200
  • 6
  • 30
  • 38
5

This is an implementation of the Pearson correlation function using NumPy:

def corr(data1, data2):
    "data1 & data2 should be NumPy arrays."
    mean1 = data1.mean()
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

    #corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MojiProg
  • 1,992
  • 1
  • 16
  • 8
3

Here is an implementation for Pearson correlation based on a sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length, but over all vector sizes, it will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    # Calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    # Adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

Unit tests:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Vipul Sharma
  • 69
  • 1
  • 2
3

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef
aumpen
  • 77
  • 2
3

Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library:

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
1

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!

It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)
Joshua Richardson
  • 1,827
  • 22
  • 22
0

Calculating Correlation:

Correlation - measures similarity of two different variables

Using Pearson correlation

from scipy.stats import pearsonr

# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# Print correlation of n*n column

Using Spearman correlation

from scipy.stats import spearmanr

# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# Print correlation of n*n column

Using Kendall correlation

kendall_correlation=final_data.corr(method='kendall')
kendall_correlation
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dataninsight
  • 1,069
  • 6
  • 13
-1
def correlation_score(y_true, y_pred):
    """Scores the predictions according to the competition rules. 
    
    It is assumed that the predictions are not constant.
    
    Returns the average of each sample's Pearson correlation coefficient"""
    
    y2 = y_pred.copy()
    y2 -= y2.mean(axis=0);    y2 /= y2.std(axis=0) 
    y1 = y_true.copy(); 
    y1 -= y1.mean(axis=0);    y1 /= y1.std(axis=0) 
        
    c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average 
        
    return c
tttzof351
  • 359
  • 2
  • 10
-2
def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r
Source
  • 1
  • 1
    Code-only answers are not considered good practice. Please consider adding a few words to explain how your code addresses the question. (read the help page on how to answer a question on SO) – Yannis Dec 20 '17 at 11:28