If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line
I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x
, and wish to compute the error on the statistic f(x)
for which usual gaussian assumptions in error analysis do not hold. x
is very large.
To do this, I resample x
using numpy.random.choice()
, where the size of my resample is the size of the original array, with replacement:
resample = np.random.choice(x, size=len(x), replace=True)
This gives me a new realization of x
. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
and then compute the statistic f(x)
on each realization;
results = [f(arr) for arr in resamples]
then I have inferred the error of f(x)
to be something like
np.std(results)
the idea being that even though f(x)
itself cannot be described using gaussian error analysis, a distribution of f(x)
measures subject to random error can be.
Okay, so that's a bootstrap. Now, my problem is that the line
resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension
results = [f(arr) for arr in resamples]
can be pretty slow too, depending on the details of the function f(x)
.