TL;DR: np.random.choice
is the main source of slowdown. The thing is Numpy use a latency-bound algorithm in this case and this is very hard to do something better in sequential (due to a random-access memory pattern). The random number generation is also pretty expensive.
Indeed, random-indexed items of the input list must be fetched from the memory. Since the input list is converted by Numpy to an contiguous taking about 4 to 8 MiB, the input array should likely not fit in your CPU cache. As a result, values are fetched from the main RAM. Moreover, the random access pattern prevent the processor to load quickly the values from the RAM. Modern mainstream x86 processor cores can fetch about 10 items concurrently (which is amazing since the code is supposed to be sequential). However, the latency of the current (DDR) RAMs is about 50-100 ns. Thus, fetching 100 million random items the the RAM should take about a second. Not to mention 100 million random values must also be generated (rather slow) and the 400~800 MiB intermediate buffer must be filled too.
One simple way to speed up this latency-bound computation is to use multiple threads. You can use Numba (or Cython) to do that. Moreover, smaller temporary arrays can be used to reduce the overhead of using huge Numpy arrays. The means can be computed on-the-fly.
import numba as nb
# Assume the input array is contiguous
@nb.njit('float64[:](int_[::1])', parallel=True)
def parallelShuffle(arr):
n, m = 100, arr.size
res = np.empty(n, np.float64)
for i in nb.prange(n):
s = 0.0
tmp = np.random.randint(0, m, m)
for j in range(m):
s += lst[tmp[j]]
res[i] = s / m
return res
lst = range(1000000)
lst = np.fromiter(lst, dtype=np.int_) # Fast iterable to Numpy conversion
result = parallelShuffle(lst)
This implementation is about 4.5 times faster on my 6-core machine. Note the integer-based Numba random number generator is sadly currently slower than the one of Numpy.