2

I want to generate a random matrix of shape (1e7, 800). But I find numpy.random.rand() becomes very slow at this scale. Is there a quicker way?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
namespace-Pt
  • 1,604
  • 1
  • 14
  • 25
  • The answer to this question may solve your problem: https://stackoverflow.com/questions/2709818/fastest-way-to-generate-1-000-000-random-numbers-in-python – cg-zhou Dec 21 '21 at 10:39
  • Do you have at least 64 GiB of RAM available on your machine? This is the size of the generated array. Because of possible copies and some space taken by the OS, you likely need at least 150 GiB of RAM. If you do not have this space, then your OS will use your storage device which is generally much much slower. If you have enough space, then please provide more details about your context and your *precise* requirements. – Jérôme Richard Dec 21 '21 at 19:19
  • 1
    @JérômeRichard In fact I'm running on a cluster that has nearly 1000GB RAM. I just want to create such an array in a fast way. (Using random.rand() takes 5 minutes or more.) – namespace-Pt Dec 22 '21 at 02:45

3 Answers3

3

A simple way to do that is to write a multi-threaded implementation using Numba:

import numba as nb
import random

@nb.njit('float64[:,:](int_, int_)', parallel=True)
def genRandom(n, m):
    res = np.empty((n, m))

    # Parallel loop
    for i in nb.prange(n):
        for j in range(m):
            res[i, j] = np.random.rand()

    return res

This is 6.4 times faster than np.random.rand() on my 6-core machine.

Note that using 32-bit floats may help to speed up a bit the computation although the precision will be lower.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
2

Numba is a good option, another option that might work well is dask.array, which will create lazy blocks of numpy arrays and perform parallel computations on blocks. On my machine I get a factor of 2 improvement in speed (for 1e6 x 1e3 matrix since I don't have enough memory on my machine).

rows = 10**6
cols = 10**3
import dask.array as da
x = da.random.random(size=(rows, cols)).compute() # takes about 5 seconds

# import numpy as np
# x = np.random.rand(rows, cols) # takes about 10 seconds

Note that .compute at the end is only to bring the computed array into memory, however in general you can continue to exploit the parallel computations with dask to get much faster calculations (that can also scale beyond a single machine), see docs.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
2

An attempt to find an answer from answers given till now:

I just wrote a script which is compiled from already given (by SultanOrazbayev and Jérôme Richard) answers and contains 3 functions for each numba, dask and numpy approach and measure the time spent for n number of different sized arrays.

The code

import dask.array as da
import matplotlib.pyplot as plt
import numba as nb
import timeit
import numpy as np


@nb.njit('float64[:,:](int_, int_)', parallel=True)
def nmb(n, m):
    res = np.empty((n, m))

    # Parallel loop
    for i in nb.prange(n):
        for j in range(m):
            res[i, j] = np.random.rand()

    return res


def nmp(n, m):
    return np.random.random((n, m))


def dask(n, m):
    return da.random.random(size=(n, m)).compute()


if __name__ == '__main__':
    data = []
    for i in range(1, 16):
        dmm = 2 ** i
        s_nmb = timeit.default_timer()
        nmb(dmm, dmm)
        e_nmb = timeit.default_timer()

        s_nmp = timeit.default_timer()
        nmp(dmm, dmm)
        e_nmp = timeit.default_timer()

        s_dask = timeit.default_timer()
        dask(dmm, dmm)
        e_dask = timeit.default_timer()

        data.append([
            dmm,
            e_nmb - s_nmb,
            e_nmp - s_nmp,
            e_dask - s_dask
        ])

    data = np.array(data)

    plt.plot(data[:, 0], data[:, 1], "-r", label="Numba")
    plt.plot(data[:, 0], data[:, 2], "-g", label="Numpy")
    plt.plot(data[:, 0], data[:, 3], "-b", label="Dask")

    plt.xlabel("Number of Element on each axes")
    plt.ylabel("Time spent (s)")
    plt.legend()
    plt.show()

The result

Random 2D Generation for different methods

MSH
  • 1,743
  • 2
  • 14
  • 22
  • 1
    Thank you for doing this benchmarking! `numba` is hard to beat.. :) for more involved computations there might be a way to integrate both `dask` and `numba`: https://examples.dask.org/applications/stencils-with-numba.html however I'm not sure about correct implementation for this specific use-case... – SultanOrazbayev Dec 22 '21 at 15:52