1

To get the correlation between two arrays in python, I am using:

from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)

However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?

My temporary solution:

After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:

import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)

It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.

Edit to clarify:

The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.

Community
  • 1
  • 1
dwitvliet
  • 7,242
  • 7
  • 36
  • 62
  • 2
    I think this question is better fitted in cross validated. – Korem Jul 22 '14 at 21:50
  • 1
    A correlation over a sample size of 3 is not sensible...I usually want at least a pair of 50 values before thinking a correlation might be useful. – N1B4 Jul 22 '14 at 23:10
  • @Korem I did consider it, but posted it here instead as it mainly is a coding issue. However, I will move it there if no one can answer here. – dwitvliet Jul 22 '14 at 23:35

0 Answers0