3

The code below is working for 2-3 seconds. Is it real to make it x5-10+ faster? I tried to make poisson bootstrap, but it works slower. Also there is no way to use joblib and multiprocessing features because I have parallelization in higher level of code.

lst = range(1000000)
result = np.random.choice(lst, size=(100, len(lst)), replace=True).mean(axis=1)

I think that the main trouble is reshuffling in np.random.choice, maybe it can be used in optimization.

  • Have you tried creating a list with the random numbers you need? Still, it's 1 million so it can take a while. – Wolfgang Dec 08 '21 at 22:55
  • @ComputersEnthusiast `range(1000000)` is just example. I have same length data, so I can't randomize numbers. – Aslan Bayramkulov Dec 08 '21 at 23:00
  • I mean random numbers for the indexes, i.e. sampling, such as lst[randrange(0, len(lst)] for list length, doubt it will be much faster though – Wolfgang Dec 08 '21 at 23:05
  • Five to ten _times_ faster? You could try implementing this in Cython (or any other performant language, like Julia). For each of the 100 draws: 1. Sample one random element; 2. Use [moving average](https://en.wikipedia.org/wiki/Moving_average) to accumulate the average; 3. Goto (1). This avoids generating the `100 x 1000000` array, which _might_ speed things up a bit – ForceBru Dec 08 '21 at 23:09
  • @ForceBru, are you sure `cython` will help? He is already using compiled `numpy` functions. There aren't any loops or other things to replace with `c` code. I suspect the problem is the shear size of task. Creating 1_000_000_000 random values, that's a 8GB array, followed by a (compiled) mean on the long dimension, returning a (100,) shape array. – hpaulj Dec 09 '21 at 00:25
  • @hpaulj, computing the moving average that I suggested requires looping, so plain Python code will be really slow - that's why I suggested Cython. The moving average consumes values one at a time and only ever stores one number per row being averaged (so 100 numbers total), which reduces the memory usage from about a billion numbers to a bit more than 100 (sampling random elements probably needs some memory too, but that should be negligible). That's just a hypothesis, though. – ForceBru Dec 09 '21 at 00:31
  • @hpaulj There is "only" 100e6 items and not 1e9. Moreover, Numpy default integers takes typically 4 bytes on Windows and 8 on Linux (and Mac?). Thus, the resulting array is about 400-800 MiB and not 8 GiB. This is still quite big though. – Jérôme Richard Dec 09 '21 at 01:28

2 Answers2

2

TL;DR: np.random.choice is the main source of slowdown. The thing is Numpy use a latency-bound algorithm in this case and this is very hard to do something better in sequential (due to a random-access memory pattern). The random number generation is also pretty expensive.

Indeed, random-indexed items of the input list must be fetched from the memory. Since the input list is converted by Numpy to an contiguous taking about 4 to 8 MiB, the input array should likely not fit in your CPU cache. As a result, values are fetched from the main RAM. Moreover, the random access pattern prevent the processor to load quickly the values from the RAM. Modern mainstream x86 processor cores can fetch about 10 items concurrently (which is amazing since the code is supposed to be sequential). However, the latency of the current (DDR) RAMs is about 50-100 ns. Thus, fetching 100 million random items the the RAM should take about a second. Not to mention 100 million random values must also be generated (rather slow) and the 400~800 MiB intermediate buffer must be filled too.

One simple way to speed up this latency-bound computation is to use multiple threads. You can use Numba (or Cython) to do that. Moreover, smaller temporary arrays can be used to reduce the overhead of using huge Numpy arrays. The means can be computed on-the-fly.

import numba as nb

# Assume the input array is contiguous
@nb.njit('float64[:](int_[::1])', parallel=True)
def parallelShuffle(arr):
    n, m = 100, arr.size
    res = np.empty(n, np.float64)
    for i in nb.prange(n):
        s = 0.0
        tmp = np.random.randint(0, m, m)
        for j in range(m):
            s += lst[tmp[j]]
        res[i] = s / m
    return res

lst = range(1000000)
lst = np.fromiter(lst, dtype=np.int_)  # Fast iterable to Numpy conversion
result = parallelShuffle(lst)

This implementation is about 4.5 times faster on my 6-core machine. Note the integer-based Numba random number generator is sadly currently slower than the one of Numpy.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Given that you're sampling with replacement over a range of indices, you don't need `lst` at all. I think if you combine my answer and yours, you will get something close to 10x improvement, but OP specifically asked for no more parallelization. – Mad Physicist Dec 09 '21 at 04:06
  • Well, the OP said in the comments that "`range(1000000)` is just example" so I think we cannot assume the list contains 1000000 unique items from 0 to 1000000. If so, I think the call to `np.random.choice` is mandatory. There are ways to speed up this in low-level heavily optimized C/C++ code but I do not see how this could be done in a sequential Python code. AFAIK, using multiple threads is the only known way to overcome the memory latency wall in such a case (hyperthreading/SMT should help a lot). – Jérôme Richard Dec 09 '21 at 18:04
1

You can get a ~3x boost from not converting a python range object to a numpy array and selecting from it:

result = np.random.choice(1000000, size=(100, 1000000), replace=True).mean(axis=1)

Selecting a bunch of numbers in this manner is identical to just generating numbers in the same range. Unfortunately, the two methods take the same amount of time:

result = np.random.randint(1000000, size=(100, 1000000)).mean(axis=1)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264