35

What is the best way, given a pandas dataframe, df, to get the correlation between its columns df.1 and df.2?

I do not want the output to count rows with NaN, which pandas built-in correlation does. But I also want it to output a pvalue or a standard error, which the built-in does not.

SciPy seems to get caught up by the NaNs, though I believe it does report significance.

Data example:

     1           2
0    2          NaN
1    NaN         1
2    1           2
3    -4          3
4    1.3         1
5    NaN         NaN
Dataman
  • 3,457
  • 3
  • 19
  • 31
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90

9 Answers9

55

To calculate all the p-values at once, you can use calculate_pvalues function (code below):

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1], 'D':['text',2,3] })
calculate_pvalues(df) 

The output is similar to the corr() (but with p-values):

            A       B       C
    A       0  0.7877  0.1789
    B  0.7877       0  0.6088
    C  0.1789  0.6088       0

Details:

  • Column D is automatically ignored as it contains text.
  • p-values are rounded to 4 decimals
  • You can subset to indicate exact columns: calculate_pvalues(df[['A','B','C']]

Following is the code of the function:

from scipy.stats import pearsonr
import pandas as pd

def calculate_pvalues(df):
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            tmp = df[df[r].notnull() & df[c].notnull()]
            pvalues[r][c] = round(pearsonr(tmp[r], tmp[c])[1], 4)
    return pvalues
toto_tico
  • 17,977
  • 9
  • 97
  • 116
46

Statistical significance denoted in asterisks:

from scipy.stats import pearsonr
import numpy as np
rho = df.corr()
pval = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
p = pval.applymap(lambda x: ''.join(['*' for t in [.05, .01, .001] if x<=t]))
rho.round(2).astype(str) + p

Correlations with asterisks

tozCSS
  • 5,487
  • 2
  • 34
  • 31
  • 4
    This is great for implementing. Any chance that it can be worked into a sns.heatmap with np.triu as mask? – wjie08 Nov 05 '20 at 06:25
  • 1
    Note that the intervals for attributing an asterix are not correct. it should be: [0.001, 0.01, 0.05] source: https://faq.edqm.eu/pages/viewpage.action?pageId=1377305 – Moritz Nov 15 '22 at 13:55
  • 1
    Corrected the interval numbers @Moritz pointed out. – tozCSS Dec 06 '22 at 21:16
32

You can use the scipy.stats correlation functions to get the p-value.

For example, if you are looking for a correlation such as pearson correlation, you can use the pearsonr function.

from scipy.stats import pearsonr
pearsonr([1, 2, 3], [4, 3, 7])

Gives output

(0.7205766921228921, 0.48775429164459994)

Where the first value in the tuple is the correlation value, and second is the p-value.

In your case, you can use pandas' dropna function to remove NaN values first.

df_clean = df[['column1', 'column2']].dropna()
pearsonr(df_clean['column1'], df_clean['column2'])
Shashank Agarwal
  • 2,769
  • 1
  • 22
  • 24
18

In pandas v0.24.0 a method argument was added to corr. Now, you can use it to compute arbitrary functions, e.g. the p-value:

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

df = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1]})

# this computes the correlation coefficients
corr = df.corr(method=lambda x, y: pearsonr(x, y)[0]) 

# this computes the p-values
pvalues = df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(df.columns)) 
pvalues
          A         B         C
A  0.000000  0.787704  0.178912
B  0.787704  0.000000  0.608792
C  0.178912  0.608792  0.000000

Please note the workaround with np.eye(len(df.columns)) for the p-values, which is needed, because self-correlations are always set to 1.0 (see https://github.com/pandas-dev/pandas/issues/25726).

Fabian Rost
  • 2,284
  • 2
  • 15
  • 27
15

Answer provided by @Shashank is nice. However, if you want a solution in pure pandas, you may like this:

import pandas as pd
from pandas.io.data import DataReader
from datetime import datetime
import scipy.stats  as stats


gdp = pd.DataFrame(DataReader("GDP", "fred", start=datetime(1990, 1, 1)))
vix = pd.DataFrame(DataReader("VIXCLS", "fred", start=datetime(1990, 1, 1)))

#Do it with a pandas regression to get the p value from the F-test
df = gdp.merge(vix,left_index=True, right_index=True, how='left')
vix_on_gdp = pd.ols(y=df['VIXCLS'], x=df['GDP'], intercept=True)
print(df['VIXCLS'].corr(df['GDP']), vix_on_gdp.f_stat['p-value'])

Results:

-0.0422917932738 0.851762475093

Same results as stats function:

#Do it with stats functions. 
df_clean = df.dropna()
stats.pearsonr(df_clean['VIXCLS'], df_clean['GDP'])

Results:

  (-0.042291793273791969, 0.85176247509284908)

To extend to more vairables I give you an ugly loop based approach:

#Add a third field
oil = pd.DataFrame(DataReader("DCOILWTICO", "fred", start=datetime(1990, 1, 1))) 
df = df.merge(oil,left_index=True, right_index=True, how='left')

#construct two arrays, one of the correlation and the other of the p-vals
rho = df.corr()
pval = np.zeros([df.shape[1],df.shape[1]])
for i in range(df.shape[1]): # rows are the number of rows in the matrix.
    for j in range(df.shape[1]):
        JonI        = pd.ols(y=df.icol(i), x=df.icol(j), intercept=True)
        pval[i,j]  = JonI.f_stat['p-value']

Results of rho:

             GDP    VIXCLS  DCOILWTICO
 GDP         1.000000 -0.042292    0.870251
 VIXCLS     -0.042292  1.000000   -0.004612
 DCOILWTICO  0.870251 -0.004612    1.000000

Results of pval:

 [[  0.00000000e+00   8.51762475e-01   1.11022302e-16]
  [  8.51762475e-01   0.00000000e+00   9.83747425e-01]
  [  1.11022302e-16   9.83747425e-01   0.00000000e+00]]
Dataman
  • 3,457
  • 3
  • 19
  • 31
BKay
  • 1,397
  • 1
  • 15
  • 26
  • What if there are more than 2 columns, is there a way to get a nice output table for correlations? – wolfsatthedoor Aug 29 '14 at 16:46
  • df.corr() will give you the correlation structure for the whole data frame but to use the regression calculation approach of the p-value would be messy. – BKay Aug 29 '14 at 17:20
  • 1
    `pd.ols` was deprecated in v 0.20.0, and `DataReader` was moved to `pandas-datareader`: https://github.com/pydata/pandas-datareader – Max Ghenis Mar 03 '19 at 00:52
2

I have tried to sum the logic in a function, it might not be the most efficient approach but will provide you with a similar output as pandas df.corr(). To use this just put the following function in your code and call it providing your dataframe object ie. corr_pvalue(your_dataframe).

I have rounded the values to 4 decimal place, in case you want different output please change the value in round function.

from scipy.stats import pearsonr
import numpy as np
import pandas as pd

def corr_pvalue(df):


    numeric_df = df.dropna()._get_numeric_data()
    cols = numeric_df.columns
    mat = numeric_df.values

    arr = np.zeros((len(cols),len(cols)), dtype=object)

    for xi, x in enumerate(mat.T):
        for yi, y in enumerate(mat.T[xi:]):
            arr[xi, yi+xi] = map(lambda _: round(_,4), pearsonr(x,y))
            arr[yi+xi, xi] = arr[xi, yi+xi]

    return pd.DataFrame(arr, index=cols, columns=cols)

I have tested it with pandas v0.18.1

Somendra Joshi
  • 1,833
  • 1
  • 10
  • 9
1

That was very useful code by oztalha. I just changed formatting (rounded to 2 digits) wherever r was not significant.

    rho = data.corr()
    pval = calculate_pvalues(data) # toto_tico's answer
    # create three masks
    r1 = rho.applymap(lambda x: '{:.2f}*'.format(x))
    r2 = rho.applymap(lambda x: '{:.2f}**'.format(x))
    r3 = rho.applymap(lambda x: '{:.2f}***'.format(x))
    r4 = rho.applymap(lambda x: '{:.2f}'.format(x))
    # apply them where appropriate --this could be a single liner
    rho = rho.mask(pval>0.1,r4)
    rho = rho.mask(pval<=0.1,r1)
    rho = rho.mask(pval<=0.05,r2)
    rho = rho.mask(pval<=0.01,r3)
    rho
  • 1
    Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others. – Tim Diekmann May 23 '18 at 09:47
1

Great answers from @toto_tico and @Somendra-joshi. However, it drops unnecessary NAs values. In this snippet, I'm just dropping the NAs that belong to the correlation being computing at the moment. In the actual corr implementation, they do the same.

def calculate_pvalues(df):
    df = df._get_numeric_data()
    dfcols = pd.DataFrame(columns=df.columns)
    pvalues = dfcols.transpose().join(dfcols, how='outer')
    for r in df.columns:
        for c in df.columns:
            if c == r:
                df_corr = df[[r]].dropna()
            else:
                df_corr = df[[r,c]].dropna()
            pvalues[r][c] = pearsonr(df_corr[r], df_corr[c])[1]
    return pvalues
Matheus Araujo
  • 5,551
  • 2
  • 22
  • 23
0

In a single line of code using list comprehension:

>>> import pandas as pd
>>> from scipy.stats import pearsonr
>>> data = {'y':[0, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6],
...         'x1':[0, 4, 2, 6, 2, 8, 6, 10, 4, 13, 5],
...         'x2':[0.0, 1.3, 0.2, 4.5, 1.3, 1.6, 3.5, 1.7, 1.6, 3.7, 1.7]}
>>> df = pd.DataFrame(data)
>>> pvals = pd.DataFrame([[pearsonr(df[c], df[y])[1] for y in df.columns] for c in df.columns],
...                      columns=df.columns, index=df.columns)
>>> pvals
           y        x1        x2
y   0.000000  0.000732  0.069996
x1  0.000732  0.000000  0.036153
x2  0.069996  0.036153  0.000000
>>> 
General Grievance
  • 4,555
  • 31
  • 31
  • 45
  • Not exactly slick, but this works and gets the desired output, p = pd.DataFrame([[pearsonr(df[c], df[y])[1] for y in df.columns] for c in df.columns], columns=df.columns, index=df.columns).copy() p["type"] = "p" p.index.name="col" p = p.set_index([p.index,"type"]) c = df.corr() c["type"] = "c" c.index.name = "col" c = c.set_index([c.index,"type"]) c.combine_first(p) – Susmit Feb 02 '21 at 10:33