8

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line

I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.

To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:

resample = np.random.choice(x, size=len(x), replace=True)

This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

and then compute the statistic f(x) on each realization;

results = [f(arr) for arr in resamples]

then I have inferred the error of f(x) to be something like

np.std(results)

the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.


Okay, so that's a bootstrap. Now, my problem is that the line

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension

results = [f(arr) for arr in resamples]

can be pretty slow too, depending on the details of the function f(x).

pretzlstyle
  • 2,774
  • 5
  • 23
  • 40
  • 1
    An approach that I've used, to avoid constructing the subsamples, is to instead create a flag array which is 1 if the corresponding datum is included in the sample and 0 if it is not. Then creating a subsample means permuting the flags. If allocating and filling the subsample array is a substantial part of the computation time, then working with flags is a win. – Robert Dodier Oct 24 '17 at 17:40
  • @RobertDodier Thanks! This is very similar to Divakar's answer below, though you use array masking rather than indexing. I'm not sure which of those would be faster - it's likely the same, I would think. – pretzlstyle Oct 24 '17 at 18:06
  • Yes, just using a list of indices is equivalent and maybe simpler if you just need to include/exclude data. I have often wanted weights other than 1 or 0, e.g. weight > 1 to represent sampling with replacement or emphasizing some data (e.g. a class which has few examples in the data) more than others. If you don't need non-0/1 weights, you can just use a list of indices. – Robert Dodier Oct 24 '17 at 18:35

2 Answers2

8

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

Runtime test with x of 5000 elems -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

For very large x

With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Ah, of course, thanks so much. I will likely implement the approach with `rand`. Also, I said `x` was "very big"; it is over 600,000 elements long. So, by extrapolating your timing results, I could save around 4 hours of run time – pretzlstyle Oct 24 '17 at 18:04
  • @jphollowed Wow with such a large `x`, you might not get to create all the indices as proposed in this post. But, give it a try I guess if you have enough system memory. – Divakar Oct 24 '17 at 18:06
  • @jphollowed Also check out `For very large x` section. – Divakar Oct 24 '17 at 18:14
2

As alluded to by @Divakar you can pass a tuple to size to get a 2d array of resamples rather than using list comprehension.

Here assume for a second that f is just sum rather than some other function. Then:

x = np.random.randn(100000)
resamples = np.random.choice(x, size=(1000, x.shape[0]), replace=True)
# resamples.shape = (1000, 1000000)
results = np.apply_along_axis(f, axis=1, arr=resamples)
print(results.shape)
# (1000,)

Here np.apply_along_axis is admittedly just a glorified for-loop equivalent to [f(arr) for arr in resamples]. But I am not exactly sure if you need to index x here based on your question.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • This is very simply and substantially faster, about 4 times. My original code to do the resampling takes 16ish ms per loop, and specifying the size as a tuple takes less than 4ms – pretzlstyle Oct 24 '17 at 18:20
  • The only problem with this is that it does not work in general - that is, if `f(x)` happens to be a second-order statistic, implying that I want to sample *without* replacement. The desired behavior would be for each *row* specified by the `size` tuple to forbid replacement, but for each subsequent row not to mind if it shares samples with the previous. This disables me from using this solution if replace=false since I am told `ValueError: Cannot take a larger sample than population when 'replace=False'` – pretzlstyle Oct 24 '17 at 18:36
  • Simply, this method cannot be used to take N resample realizations of a dataset, where each realization is sampled without replacement. – pretzlstyle Oct 24 '17 at 18:41
  • 1
    @jphollowed For `replace=False` , a vectorized approach would be - https://stackoverflow.com/a/45438143/ – Divakar Oct 24 '17 at 18:42