1

Theoretically, p-values are uniformly distributed under the null hypothesis.

Therefore, I would expect p-values from G-test or Chi-square test to test equal proportions to provide uniformly distributed p-values when I apply it to some random coin flip simulations using Python's random.randint(0,1), which should be an unbiased random coin, i.e., a Bernoulli(0.5).

Likewise, in case n*p is sufficiently large, the assumptions behind a t-test become reasonable, and we would expect a t-test to give uniformly distributed p-values too.

However, that is not what I empirically see.

I plot a histogram of p-values from repeated experiments with sample size 20k, using the following snippet:

from scipy import stats
from matplotlib import pyplot as plt

ps = []
for i in range(5000):
    heads = [random.randint(0,1) for _ in range(20000)]
    tails = [1-x for x in heads]
    p = stats.ttest_ind(heads, tails).pvalue
    ps.append(p)
plt.hist(ps, 100)

This results in the following distribution of p-values, which seems to give p-values close to 0 much more often than expected. Note that this is not due to the approximations of the t-test, as I find similar distributions of p-values when I plug in a Chi-square or G-test.

enter image description here

Am I running into a situation here where Python's pseudorandom number generator (which are based on Mersenne Twister algorithm) simply do not have sufficiently good statistical properties and are simply not random enough? Or is there something else that I am missing here?

Niek Tax
  • 841
  • 1
  • 11
  • 30

2 Answers2

1

As Sam Mason pointed out in a comment, a two-distribution t is supposed to have independent samples. The numbers of heads and tails in a given run are negatively correlated, so your program isn't measuring what you think it is.

The following code produces histograms which are relatively uniform. Given the numbers of involved, it takes several minutes on my laptop.

from scipy import stats
from matplotlib import pyplot as plt

ps = []
for i in range(5000):
    heads = stats.bernoulli.rvs(0.5, size=2000000)
    p = stats.ttest_1samp(heads, popmean=0.5).pvalue
    ps.append(p)
plt.hist(ps, 50)
plt.show()

Histogram of p-values showing uniformity

pjs
  • 18,696
  • 4
  • 27
  • 56
  • no comment form OP, but think they wanted to test the `random` module not `numpy`... not that it matters as they would both be fine–if a test this simple failed it would have been noticed a **long** time ago – Sam Mason Feb 05 '23 at 19:43
  • 1
    @SamMason It produces the same result with `random`, it’s just slower. – pjs Feb 05 '23 at 21:30
  • 1
    (to other readers) this is true because Numpy uses a MT19937 generator for it's [legacy generator](https://numpy.org/doc/stable/reference/random/legacy.html) the same as CPython. the details are a bit different but they shouldn't matter here – Sam Mason Feb 05 '23 at 23:20
1

The incorrect usage of ttest_ind API seems to the main issue. But this question incidentally also touches on the CPython implementation of the random module. Another "stats" issue is the combinatorial issue of many Bernouli draws.

The primitive MT operation used by current CPython's random module is genrand_uint32. This is exposed via random.getrandbits (source), which is in turn used by randint via randrange.

The stats issue is that there are only a few ways for heads and tails to be "very" similar, i.e. p-values close to 1. This is presumably why OP is taking the time to draw 20k values, as using only a thousand values will produce histograms with artefacts due to some bins being empty.

To fix all of these issues, I'd suggest using random.getrandbits directly and plotting the result as a CDF. Something like:

import random
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

ps = [
    stats.ttest_ind(
        [random.getrandbits(1) for _ in range(1000)],
        [random.getrandbits(1) for _ in range(1000)],
    ).pvalue
    for _ in range(10_000)
]

fig, ax = plt.subplots()
ax.plot([0, 1], [0, 1], '--k', lw=1, alpha=0.5, label="Expected")
ax.plot(sorted(ps), np.linspace(0, 1, len(ps)), label="Observed")
ax.legend()
ax.set_xlabel("$p$-value")
ax.set_ylabel("CDF");

which might produce:

matplotlib output of above

which demonstrates the combinatorial issue nicely while only taking a few seconds to generate.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60