Efficient dot products of large memory-mapped arrays

Question

I'm working with some rather large, dense numpy float arrays that currently reside on disk in PyTables CArrays. I need to be able to perform efficient dot products using these arrays, for example C = A.dot(B), where A is a huge (~1E4 x 3E5 float32) memory-mapped array, and B and C are smaller numpy arrays that are resident in core memory.

What I'm doing at the moment is copying the data into memory-mapped numpy arrays using np.memmap, then calling np.dot directly on the memory-mapped arrays. This works, but I suspect that the standard np.dot (or rather the underlying BLAS functions it calls) is probably not very efficient in terms of the number of I/O operations required in order to compute the result.

I came across an interesting example in this review article. A naive dot product computed using 3x nested loops, like this:

def naive_dot(A, B, C):
    for ii in xrange(n):
        for jj in xrange(n):
            C[ii,jj] = 0
            for kk in xrange(n):
                C[ii,jj] += A[ii,kk]*B[kk,jj]
    return C

requires O(n^3) I/O operations to compute.

However, by processing the arrays in appropriately-sized blocks:

def block_dot(A, B, C, M):
    b = sqrt(M / 3)
    for ii in xrange(0, n, b):
        for jj in xrange(0, n, b):
            C[ii:ii+b,jj:jj+b] = 0
            for kk in xrange(0, n, b):
                C[ii:ii+b,jj:jj+b] += naive_dot(A[ii:ii+b,kk:kk+b], 
                                                B[kk:kk+b,jj:jj+b],
                                                C[ii:ii+b,jj:jj+b])
    return C

where M is the maximum number of elements that will fit into core memory, the number of I/O operations is reduced to O(n^3 / sqrt(M)).

How smart is np.dot and/or np.memmap? Does calling np.dot perform an I/O-efficient blockwise dot product? Does np.memmap do any fancy caching that would improve the efficiency of this type of operation?

If not, is there some pre-existing library function that performs I/O efficient dot products, or should I try and implement it myself?

Update

I've done some benchmarking with a hand-rolled implementation of np.dot that operates on blocks of the input array, which are explicitly read into core memory. This data at least a partially addresses my original question, so I'm posting it as an answer.

`numpy.dot` is not a naive algorithm. See, for example, [this SO question](http://stackoverflow.com/questions/10442365/why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python) — jonrsharpe, Jan 07 '14 at 23:11
@jonrsharpe I'm aware that BLAS libraries have all kinds of fancy optimizations, multithreading and whatnot. However, I'm not sure whether these optimizations are just for reducing the number of flops, or whether they would also be geared towards I/O efficiency. I suppose it may also depend on the particular BLAS library in question (OpenBLAS in my case). — ali_m, Jan 07 '14 at 23:14
[SWAG](https://en.wikipedia.org/wiki/Scientific_Wild-Ass_Guess): Have you looked into [numexpr](https://code.google.com/p/numexpr/) on google code and [at the Cheese factory](https://pypi.python.org/pypi/numexpr)? — Mark Mikofski, Jan 07 '14 at 23:39
@MarkMikofski Thanks, but that's not really the sort of thing I'm looking for - firstly because I want to do fast linear algebra operations on whole matrices rather than elementwise operations, and secondly because I'm mainly I/O bound rather than CPU bound in this case. — ali_m, Jan 07 '14 at 23:49
in that case what about [cuda](https://pypi.python.org/pypi/pycuda)? — Mark Mikofski, Jan 08 '14 at 00:01
@MarkMikofski No, when I say that I am "I/O-bound", I mean that the main factor slowing me down is having to read data from the hard disk into system memory. Being able to process things in parallel won't really speed things up at all if the limiting factor is reading it off the hard disk in the first place. — ali_m, Jan 08 '14 at 00:11
The only benchmark that matters is your code. Have you tried to run various implementations (a smaller case with appropriately adjusted environment) and see what kind of time performance do you get? If possible try to find an algorithm that doesn't require matrix multiplication that in the best (possible future) case is quadratic — jfs, Jan 10 '14 at 04:25
related: [Numpy efficient big matrix multiplication](http://stackoverflow.com/q/19358984/4279) — jfs, Jan 10 '14 at 04:27
@J.F.Sebastian I'm trying to implement [this algorithm](http://arxiv.org/pdf/1007.5510.pdf) for approximating the SVD of large matrices. I don't think there's a way to do it without matrix multiplication. — ali_m, Jan 10 '14 at 11:16
@J.F.Sebastian I've updated my question with some benchmarking data — ali_m, Jan 13 '14 at 03:54
hmm, shouldnt you be interested in np.einsum instead of np.dot ? Another question: are they sparse? — usethedeathstar, Jan 14 '14 at 12:23
@usethedeathstar 1) I haven't tried `np.einsum` yet because I couldn't think of any particular reason why it might be faster than `np.dot`. For computing the dot product of two arrays that are in core memory, `np.dot` will be faster than the equivalent call to `np.einsum`, since it can use more heavily optimized BLAS functions. In my case there would probably be almost no difference, since I'm I/O bound. 2) No, as I said in the description they are dense matrices. — ali_m, Jan 14 '14 at 14:08
@usethedeathstar quick example: `np.dot(a, b)`, where `a` is a `(21474836, 200)` `memmap` and `b` is a `(200, 10)` `ndarray`, takes 36.5s, whereas `np.einsum('ij,jk', a, b)` takes 1min 15s — ali_m, Jan 14 '14 at 18:50

ali_m · Answer 1 · 2014-01-13T17:13:59.693

I've implemented a function for applying np.dot to blocks that are explicitly read into core memory from the memory-mapped array:

import numpy as np

def _block_slices(dim_size, block_size):
    """Generator that yields slice objects for indexing into 
    sequential blocks of an array along a particular axis
    """
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count > dim_size:
            raise StopIteration

def blockwise_dot(A, B, max_elements=int(2**27), out=None):
    """
    Computes the dot product of two matrices in a block-wise fashion. 
    Only blocks of `A` with a maximum size of `max_elements` will be 
    processed simultaneously.
    """

    m,  n = A.shape
    n1, o = B.shape

    if n1 != n:
        raise ValueError('matrices are not aligned')

    if A.flags.f_contiguous:
        # prioritize processing as many columns of A as possible
        max_cols = max(1, max_elements / m)
        max_rows =  max_elements / max_cols

    else:
        # prioritize processing as many rows of A as possible
        max_rows = max(1, max_elements / n)
        max_cols =  max_elements / max_rows

    if out is None:
        out = np.empty((m, o), dtype=np.result_type(A, B))
    elif out.shape != (m, o):
        raise ValueError('output array has incorrect dimensions')

    for mm in _block_slices(m, max_rows):
        out[mm, :] = 0
        for nn in _block_slices(n, max_cols):
            A_block = A[mm, nn].copy()  # copy to force a read
            out[mm, :] += np.dot(A_block, B[nn, :])
            del A_block

    return out

I then did some benchmarking to compare my blockwise_dot function to the normal np.dot function applied directly to a memory-mapped array (see below for the benchmarking script). I'm using numpy 1.9.0.dev-205598b linked against OpenBLAS v0.2.9.rc1 (compiled from source). The machine is a quad-core laptop running Ubuntu 13.10, with 8GB RAM and an SSD, and I've disabled the swap file.

Results

As @Bi Rico predicted, the time taken to compute the dot product is beautifully O(n) with respect to the dimensions of A. Operating on cached blocks of A gives a huge performance improvement over just calling the normal np.dot function on the whole memory-mapped array:

enter image description here

It's surprisingly insensitive to the size of the blocks being processed - there's very little difference between the time taken to process the array in blocks of 1GB, 2GB or 4GB. I conclude that whatever caching np.memmap arrays natively implement, it seems to be very suboptimal for computing dot products.

Further questions

It's still a bit of a pain to have to manually implement this caching strategy, since my code will probably have to run on machines with different amounts of physical memory, and potentially different operating systems. For that reason I'm still interested in whether there are ways to control the caching behaviour of memory-mapped arrays in order to improve the performance of np.dot.

I noticed some odd memory handling behaviour as I was running the benchmarks - when I called np.dot on the whole of A I never saw the resident set size of my Python process exceed about 3.8GB, even though I have about 7.5GB of RAM free. This leads me to suspect that there is some limit imposed on the amount of physical memory an np.memmap array is allowed to occupy - I had previously assumed that it would use whatever RAM the OS allows it to grab. In my case it might be very beneficial to be able to increase this limit.

Does anyone have any further insight into the caching behaviour of np.memmap arrays that would help to explain this?

Benchmarking script

def generate_random_mmarray(shape, fp, max_elements):
    A = np.memmap(fp, dtype=np.float32, mode='w+', shape=shape)
    max_rows = max(1, max_elements / shape[1])
    max_cols =  max_elements / max_rows
    for rr in _block_slices(shape[0], max_rows):
        for cc in _block_slices(shape[1], max_cols):
            A[rr, cc] = np.random.randn(*A[rr, cc].shape)
    return A

def run_bench(n_gigabytes=np.array([16]), max_block_gigabytes=6, reps=3,
              fpath='temp_array'):
    """
    time C = A * B, where A is a big (n, n) memory-mapped array, and B and C are
    (n, o) arrays resident in core memory
    """

    standard_times = []
    blockwise_times = []
    differences = []
    nbytes = n_gigabytes * 2 ** 30
    o = 64

    # float32 elements
    max_elements = int((max_block_gigabytes * 2 ** 30) / 4)

    for nb in nbytes:

        # float32 elements
        n = int(np.sqrt(nb / 4))

        with open(fpath, 'w+') as f:
            A = generate_random_mmarray((n, n), f, (max_elements / 2))
            B = np.random.randn(n, o).astype(np.float32)

            print "\n" + "-"*60
            print "A: %s\t(%i bytes)" %(A.shape, A.nbytes)
            print "B: %s\t\t(%i bytes)" %(B.shape, B.nbytes)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res1 = np.dot(A, B)
                t = time.time() - tic
                best = min(best, t)
            print "Normal dot:\t%imin %.2fsec" %divmod(best, 60)
            standard_times.append(best)

            best = np.inf
            for _ in xrange(reps):
                tic = time.time()
                res2 = blockwise_dot(A, B, max_elements=max_elements)
                t = time.time() - tic
                best = min(best, t)
            print "Block-wise dot:\t%imin %.2fsec" %divmod(best, 60)
            blockwise_times.append(best)

            diff = np.linalg.norm(res1 - res2)
            print "L2 norm of difference:\t%g" %diff
            differences.append(diff)

        del A, B
        del res1, res2
        os.remove(fpath)

    return (np.array(standard_times), np.array(blockwise_times), 
            np.array(differences))

if __name__ == '__main__':
    n = np.logspace(2,5,4,base=2)
    standard_times, blockwise_times, differences = run_bench(
                                                    n_gigabytes=n,
                                                    max_block_gigabytes=4)

    np.savez('bench_results', standard_times=standard_times, 
             blockwise_times=blockwise_times, differences=differences)

please submit your system parameters and python\numpy\packages are all them x64? — mrgloom, Jan 15 '14 at 14:10
@mrgloom Everything is x64. All the other relevant parameters are described in my answer. — ali_m, Jan 15 '14 at 22:35
You should probably be able to reduce dot products of slices with a solution like [Strassen algorithm](http://mathworld.wolfram.com/StrassenFormulas.html). But it would cost you more memory. (it means smaller slices I guess) — Mehdi, Jan 16 '14 at 16:02
@Mehdi That's nice to know for future use. Unfortunately, memory consumption is what I'm most limited by at the moment. The reduced numerical stability might also be an issue. My guess is that for the actual dot product it would be very hard to beat the optimized BLAS functions in terms of speed. — ali_m, Jan 16 '14 at 16:49

Bi Rico · Answer 2 · 2014-01-08T00:25:05.170

I don't think numpy optimizes dot product for memmap arrays, if you look at the code for matrix multiply, which I got here, you'll see that the function MatrixProduct2 (as currently implemented) computes the values of the result matrix in c memory order:

op = PyArray_DATA(ret); os = PyArray_DESCR(ret)->elsize;
axis = PyArray_NDIM(ap1)-1;
it1 = (PyArrayIterObject *)
    PyArray_IterAllButAxis((PyObject *)ap1, &axis);
it2 = (PyArrayIterObject *)
    PyArray_IterAllButAxis((PyObject *)ap2, &matchDim);
NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(ap2));
while (it1->index < it1->size) {
    while (it2->index < it2->size) {
        dot(it1->dataptr, is1, it2->dataptr, is2, op, l, ret);
        op += os;
        PyArray_ITER_NEXT(it2);
    }
    PyArray_ITER_NEXT(it1);
    PyArray_ITER_RESET(it2);
}

In the above code, op is the return matrix, dot is the 1d dot product function and it1 and it2 are iterators over the input matrices.

That being said, it looks like your code might already be doing the right thing. In this case the optimal performance is actually much better than O(n^3/sprt(M)), you can limit your IO to only reading each item of A once from disk, or O(n). Memmap arrays naturally have to do some caching behind the scene and inner loop operates on it2, so if A is in C-order and the memmap cache is big enough, your code might already be working. You can enforce caching of rows of A explicitly by doing something like:

def my_dot(A, B, C):

    for ii in xrange(n):
        A_ii = np.array(A[ii, :])
        C[ii, :] = A_ii.dot(B)

    return C

That's reassuring - I wonder to what extent other linalg operations will tend to work nicely with the cache of memmapped arrays. Do you happen to know offhand whether it's possible to control the cache size? I've never found a good resource explaining how caching and memory useage are controlled by memmap. — ali_m, Jan 08 '14 at 00:29
Note that PyArray_MatrixProduct2 is *only* used by np.dot in the cases where BLAS cannot be called (e.g. non-BLAS compatible memory order, non-float data type, no BLAS library installed). See [here](https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L374) — pv., Jan 08 '14 at 09:38
Based on the fact that it uses 4 of my cores, `np.dot` does indeed seem to call BLAS when multiplying a memmapped float32 array with a non-memmapped float32 array, so `PyArray_MatrixProduct2` probably doesn't get called. — ali_m, Jan 08 '14 at 15:25

mrgloom · Answer 3 · 2014-05-21T07:41:59.423

I recomend you to use PyTables instead of numpy.memmap. Also read their presentations about compression, it sounds strange to me but seems that sequence "compress->transfer->uncompress" is faster then just transfer uncompressed.

Also use np.dot with MKL. And I don't know how numexpr(pytables also seems have something like it) can be used for matrix multiplication, but for example for calculating euclidean norm it's the fastest way(comparing with numpy).

Try to benchmark this sample code:

import numpy as np
import tables
import time
n_row=1000
n_col=1000
n_batch=100
def test_hdf5_disk():
    rows = n_row
    cols = n_col
    batches = n_batch
    #settings for all hdf5 files
    atom = tables.Float32Atom()
    filters = tables.Filters(complevel=9, complib='blosc') # tune parameters
    Nchunk = 4*1024  # ?
    chunkshape = (Nchunk, Nchunk)
    chunk_multiple = 1
    block_size = chunk_multiple * Nchunk

    fileName_A = 'carray_A.h5'
    shape_A = (n_row*n_batch, n_col)  # predefined size
    h5f_A = tables.open_file(fileName_A, 'w')
    A = h5f_A.create_carray(h5f_A.root, 'CArray', atom, shape_A, chunkshape=chunkshape, filters=filters)
    for i in range(batches):
        data = np.random.rand(n_row, n_col)
        A[i*n_row:(i+1)*n_row]= data[:]
    rows = n_col
    cols = n_row
    batches = n_batch
    fileName_B = 'carray_B.h5'
    shape_B = (rows, cols*batches)  # predefined size
    h5f_B = tables.open_file(fileName_B, 'w')
    B = h5f_B.create_carray(h5f_B.root, 'CArray', atom, shape_B, chunkshape=chunkshape, filters=filters)
    sz= rows/batches
    for i in range(batches):
        data = np.random.rand(sz, cols*batches)
        B[i*sz:(i+1)*sz]= data[:]
    fileName_C = 'CArray_C.h5'
    shape = (A.shape[0], B.shape[1])
    h5f_C = tables.open_file(fileName_C, 'w')
    C = h5f_C.create_carray(h5f_C.root, 'CArray', atom, shape, chunkshape=chunkshape, filters=filters)
    sz= block_size
    t0= time.time()
    for i in range(0, A.shape[0], sz):
        for j in range(0, B.shape[1], sz):
            for k in range(0, A.shape[1], sz):
                C[i:i+sz,j:j+sz] += np.dot(A[i:i+sz,k:k+sz],B[k:k+sz,j:j+sz])
    print (time.time()-t0)
    h5f_A.close()
    h5f_B.close()
    h5f_C.close()

The problem that I don't know how to tune chunk size and compression rate to current machine, so I think performance can be dependent on parameters.

Also please note that all matrices in sample code are stored on disk, if some of them will be stored in RAM I think it will be faster.

By the way I'm using x32 machine and with numpy.memmap I have some limitations on matrix size(I'm not sure but it seems that view size can be only ~2Gb) and PyTables have no limitations.

Operating on PyTables arrays is somewhat attractive, in part because the data is already stored in PyTables arrays. However, they are trickier to deal with than numpy arrays. I also have to perform dot products on the transpose of `A`, and since they lack a transpose method, this makes my indexing a lot more awkward. The biggest problem may be choosing an appropriate chunkshape, since I also have to perform operations on single rows/columns of `A` as well as dot products which are best performed on square blocks. — ali_m, Jan 15 '14 at 22:40
Whether PyTables arrays will be faster then memmap arrays will all depend on how compressible my real data is, and therefore how much I/O bandwidth I can save. Unfortunately I don't have a real dataset on my local machine to test (as I said, they're rather large...), but I can tell you that with the Gaussian synthetic data I've been using there is no performance advantage to using PyTables CArrays rather than memmaps. This isn't surprising at all, since random data is by definition uncompressible. I'll do some benchmarks with real data when I get a chance. — ali_m, Jan 15 '14 at 22:45
here is also some advices https://groups.google.com/forum/#!topic/pytables-users/1jJhriRLDS4 — mrgloom, Jan 16 '14 at 07:32
In that thread I think Anthony Scopatz assumes that your input arrays are small enough to be held in memory. Sure, it would be quicker to call `np.dot` on the whole arrays, but I obviously can't do that. — ali_m, Jan 16 '14 at 09:07

Efficient dot products of large memory-mapped arrays

Update

3 Answers3

Results

Further questions

Benchmarking script

Linked