I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.
18 Answers
You can have a look at scipy.stats
:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
Output:
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

- 30,738
- 21
- 105
- 131

- 3,818
- 2
- 16
- 10
-
2How about correlation coefficient of two dictionaries ?! – Areza May 22 '13 at 23:49
-
2@user702846 Pearson correlation is defined on a 2xN matrix. There is no generally applicable method that converts two dictionaries into a 2xN matrix, but you might use the array of pairs of dictionary values corresponding to the keys of the intersection of the keys of your dictionaries. – winerd Aug 18 '15 at 09:23
The Pearson correlation can be calculated with NumPy's corrcoef
.
import numpy
numpy.corrcoef(list1, list2)[0, 1]

- 30,738
- 21
- 105
- 131

- 1,813
- 1
- 14
- 12
-
1the output is confusing, but actually very simple. check this explanation https://stackoverflow.com/a/3425548/1245622 – linqu Dec 20 '17 at 09:21
-
7This does not produce the requested significance of the correlation right? – Olivier Aug 06 '20 at 07:30
An alternative can be a native SciPy function from linregress which calculates:
slope : slope of the regression line
intercept : intercept of the regression line
r-value : correlation coefficient
p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : Standard error of the estimate
And here is an example:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
will return you:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

- 30,738
- 21
- 105
- 131

- 214,103
- 147
- 703
- 753
-
2Great answer - by far the most informative. Also works with a two-row pandas.DataFrame: `lineregress(two_row_df)` – dmeu Mar 09 '16 at 10:09
-
-
Watch out: `stderr` is the standard error for the slope, not the Pearson correlation coefficient. – MRule Feb 26 '23 at 19:06
If you don't feel like installing SciPy, I've used this quick hack, slightly modified from Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den

- 30,738
- 21
- 105
- 131

- 4,226
- 2
- 29
- 36
-
2I was surprised to discover this disagrees with Excel, NumPy, and R. See http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python/7939259#7939259. – dfrankow Oct 29 '11 at 13:48
-
2As another commenter pointed out, this has a float/int bug. I think sum_y/n is integer division for ints. If you use sum_x = float(sum(x)) and sum_y = float(sum(y)), it works. – dfrankow Oct 30 '11 at 20:05
-
@dfrankow I think it's because imap cannot handle float. python gives an `TypeError: unsupported operand type(s) for -: 'itertools.imap' and 'float'` at `num = psum - (sum_x * sum_y/n)` – alvas Jan 24 '13 at 14:17
-
4As a style note Python frowns on this unnecessary usage of map (in favor of list comprehensions) – Maxim Khesin Nov 22 '13 at 20:58
-
16Just as a comment, consider that libraries as scipy et al are developed by people knowing a lot of numerical analysis. This may avoid you a lot of common pitfalls (for instance, having very large and very little numbers in X or Y may result in catastrofic cancellation) – finiteautomata Sep 30 '14 at 22:28
The following code is a straight-up interpretation of the definition:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
Test:
print pearson_def([1,2,3], [1,5,7])
returns
0.981980506062
This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
EDIT: Fixed a bug pointed out by a commenter.
-
4Beware of the type of the variables! You have encountered an int/float problem. In `sum(x) / len(x)` you divide ints, not floats. So `sum([1,5,7]) / len([1,5,7]) = 13 / 3 = 4`, according to integer division (whereas you want `13. / 3. = 4.33...`). To fix it rewrite this line as `float(sum(x)) / float(len(x))` (one float suffices, as Python converts it automatically). – Piotr Migdal Oct 30 '11 at 19:46
-
Your code won't work for cases like: [10,10,10],[0,0,0] or [10,10],[10,0]. or even [10,10],[10,10] – madCode May 18 '12 at 16:56
-
4The correlation coefficient is not defined for any of those cases. Putting them into R returns "NA" for all three. – dfrankow May 18 '12 at 17:55
You can do this with pandas.DataFrame.corr
, too:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
This gives
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000

- 124,992
- 159
- 614
- 958
Rather than rely on NumPy or SciPy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson correlation coefficient (PCC).
import math
# Calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# Calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# Calculates the PCC using both the two functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# Multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
The significance of PCC is basically to show you how strongly correlated the two variables/lists are.
It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.

- 30,738
- 21
- 105
- 131

- 709
- 3
- 13
- 28
-
3
-
5It has amazing complexity and slow performance on 2 lists with 500+ values. – Nikolay Fominyh Feb 07 '16 at 13:58
Pearson coefficient calculation using Pandas in Python:
I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the Python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis
Example:
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.

- 30,738
- 21
- 105
- 131

- 119
- 1
- 4
-
Hello, welcome to StackOverflow! Try adding a short description of why you chose this code and how it applies in this case at the start of your answer! – Tristo Sep 09 '18 at 15:19
Hmm, many of these responses have long and hard-to-read code...
I'd suggest using NumPy with its nifty features when working with arrays:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

- 30,738
- 21
- 105
- 131

- 109
- 1
- 3
-
Although I like very much this answer, I would advise to copy/clone both X and Y inside the function. Otherwise both are altered, which might not be a desired behaviour. – antonimmo Jun 27 '17 at 17:56
Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using Numba.
import numba
@numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)

- 30,738
- 21
- 105
- 131

- 3,200
- 6
- 30
- 38
-
There isn't anyone by the user name "mkh'" here. What answer does it refer to? – Peter Mortensen Jun 26 '23 at 12:49
This is an implementation of the Pearson correlation function using NumPy:
def corr(data1, data2):
"data1 & data2 should be NumPy arrays."
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
#corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
return corr

- 30,738
- 21
- 105
- 131

- 1,992
- 1
- 16
- 8
Here is an implementation for Pearson correlation based on a sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length, but over all vector sizes, it will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
indexed_feature_dict = {}
if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
sum_a = sum(value for index, value in first_feature_vector)
sum_b = sum(value for index, value in second_feature_vector)
avg_a = float(sum_a) / length_of_featureset
avg_b = float(sum_b) / length_of_featureset
mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
covariance_a_b = 0
# Calculate covariance for the sparse vectors
for tuple in first_feature_vector:
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
indexed_feature_dict[tuple[0]] = tuple[1]
count_of_features = 0
for tuple in second_feature_vector:
count_of_features += 1
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
if tuple[0] in indexed_feature_dict:
covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
del (indexed_feature_dict[tuple[0]])
else:
covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
for index in indexed_feature_dict:
count_of_features += 1
covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
# Adjust covariance with rest of vector with 0 value
covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
if mean_sq_error_a == 0 or mean_sq_error_b == 0:
return -1
else:
return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
Unit tests:
def test_get_get_pearson_corelation(self):
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

- 30,738
- 21
- 105
- 131

- 69
- 1
- 2
I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef

- 77
- 2
Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation
) is directly available in the standard library:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

- 30,738
- 21
- 105
- 131

- 54,987
- 21
- 291
- 190
You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!
It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
# if positive, p is the probability that there is no positive correlation in
# the population sampled by X and Y
# if negative, p is the probability that there is no negative correlation
# if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)

- 1,827
- 22
- 22
Calculating Correlation:
Correlation - measures similarity of two different variables
Using Pearson correlation
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# Print correlation of n*n column
Using Spearman correlation
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# Print correlation of n*n column
Using Kendall correlation
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation

- 30,738
- 21
- 105
- 131

- 1,069
- 6
- 13
def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c

- 359
- 2
- 10
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r

- 1
-
1Code-only answers are not considered good practice. Please consider adding a few words to explain how your code addresses the question. (read the help page on how to answer a question on SO) – Yannis Dec 20 '17 at 11:28