10

I am running scipy.stats.pearsonr on my data, and I get

(0.9672434106763087, 0.0)

It is reasonable that the r-value is high and the p-value is very low. However, p is obviously not 0, so I would like to know what p=0.0 means. Is it p<10^-10, p<10^-100 or what is the limit?

  • 5
    The smallest possible 64bit floating point number is 2.225e-308 as reported by [`numpy.finfo(float)`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.finfo.html) on my system, so your p is probably less than that. But for any practical purpose, why would you care? – MB-F Aug 28 '17 at 08:12
  • Thanks. I would be surprised if it was the floating point limit that set the p value limit, though. – Marius Folden Simonsen Aug 29 '17 at 07:49
  • If it's not floating point, I have no idea what limit you are talking about. By varying the number of samples we can obtain arbitrarily low p-values - until the floating points bite. – MB-F Aug 29 '17 at 08:04
  • I guess my question is how the p-value is calculated. For example, if done using a Monte Carlo algorithm, I would not expect it to do 2.225e-308 iterations. – Marius Folden Simonsen Aug 30 '17 at 08:57
  • 2
    No, that would be very few iterations indeed :) The p-value is computed analytically. [How exactly](https://github.com/scipy/scipy/blob/v0.19.1/scipy/stats/stats.py#L3029) I'm not sure, but I think it is based on the [assumption that the data is normally distributed](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Testing_using_Student.27s_t-distribution). – MB-F Aug 30 '17 at 11:22
  • Thanks for the link, it makes sense. I guess I should have looked there in the first place. – Marius Folden Simonsen Aug 31 '17 at 08:14

1 Answers1

0

As pointed out by @MB-F in the comments it is calculated analytically.

In the code for the version 0.19.1, you could isolate that part of the code and plot the p-value in terms of r

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import betainc
r = np.linspace(-1, 1, 1000)*(1-1e-10);

for n in [10, 100, 1000]:
    df = n - 2
    t_squared = r**2 * (df / ((1.0 - r) * (1.0 + r)))
    prob = betainc(0.5*df, 0.5, df/(df+t_squared))
    plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()

The current stable version 1.9.3 uses a different formula

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import btdtr
r = np.linspace(-1, 1, 1000)*(1-1e-10);
for n in [10, 100, 1000]:
    ab = 0.5*n
    prob = btdtr(ab, ab, 0.5*(1-abs(r)))
    plt.semilogy(r, prob, label=f'n={n}')
plt.axvline(0.9672434106763087, ls='--', color='black', label='r value')
plt.legend()
plt.grid()

But yield the same results.

You can see that if you have 1000 points and your correlation, the p value will be less than the minimum floating value.

The beta distribution

Scipy provides a collection of probability distributions, among them, the beta distribution.

The line

    prob = btdtr(ab, ab, 0.5*(1-abs(r)))

could be replaced by

from scipy.stats import beta
prob = beta(ab, ab).cdf(0.5*(1-abs(r)))

There you can get much more information about it.

Bob
  • 13,867
  • 1
  • 5
  • 27