Numpy huge matrix dot product while multiprocessing

Question

I'm implementing a special case of EM-GMM.

X is the data matrix of shape [1000000, 900] and is a numpy mmap object
Q is a precision matrix of shape [900, 900] and is a ndarray

I'm also using the Multiprocessing library to go over 200 Q matrices concurrently on 40 cores, using the same data matrix (X).

It works over smaller dimensions like [1mil, 196], [1mil, 400],
but when i try to run the [1mil, 900] at some point on of the processes throws an exception:

OSError: [Errno 12] Cannot allocate memory

I guess the issue is because of 2 big calculations I have, which probably allocate big matrices.

As part of the E-step I need to calculate:
np.sum(X.dot(Q) * X, axis=1)

As part of the M-step I need to calculate (W is a [1mil, 1] weights vector):
(X.T * W).dot(X)

In the future I would have to run this EM-GMM over data of even bigger size (of shape [2mil, 2500] and even [2mil, 10k])
What can I do to make those calculation more memory efficient?

EDIT:

I've noticed that the worker initialization uses pickle, so the X-matrix is turned into ndarray and the workers doesn't share it (which means the X-matrix is duplicated for all workers and fills my RAM)

I have an idea of how to solve it, and will update if it's fixed.
But If anyone has a good idea of how to deal with it I'll be grateful.

Does `np.einsum('ij, jk, ik -> i', X, Q, X)` and `np.einsum('ji, ij, jk -> ik', X, W, X)` help? Maybe with an `optimize = True` parameter in `np.einsum`? Not an expert on `memmap` but that should prevent some intermediate matrices form being made (hopefully). — Daniel F, Oct 09 '18 at 08:56
The * multiplication is a element-wise multiplication, not a matrix dot product, So i don't think einsum will help — Shahaf Finder, Oct 09 '18 at 09:00
`np.einsum('ij, ij -> ij', A, B)` *is* element-wise multiplication. It's just `einsum` allows the linear algebra to be done all in one step. (i.e. `ij, jk, ik -> ik, ik -> ik -> i` and `ji, ij, jk -> ij, jk -> ik`). That's the beneft of Einstein summation notation - you can do most common linear algebra operations within one framework. — Daniel F, Oct 09 '18 at 10:16
After few test runs on smaller scale data, it seems like einsum is slower than the way I did it. probably because of the memmap object. My guess is that it allocates memory for X twice to pass it to the C implementation of einsum, I'll also start a big matrix run to see if it runs out of memory but it'll take some time until it crashes — Shahaf Finder, Oct 09 '18 at 12:05
Are you limited to numpy-memmap or can you also make use of more sophisticeted storage systems like HDF5? Numpy memmap only provides good performance on reading data contigously from disk (along the last axis in an C ordered array). All your calculations can be splitted into parts, basically even in your last example [2mil, 10k] less than 2GB RAM should be sufficient (well even less, but than the disk IO will be far to high). You share the X- matrix between workers and not copying it to every worker right? What's your RAM limitation? — max9111, Oct 09 '18 at 15:48
I do want to share my X-matrix between workers for read-only. And I just noticed that the memmap is turning into ndarray when initializing the workers, I'll write an update on that in the original question. I have about 128GB RAM, but it's shared with others in my lab, are you sure 2GB ram is enough? maybe 2GB for each worker (I have 40 of them)? — Shahaf Finder, Oct 09 '18 at 17:06

score 1 · Accepted Answer · answered Oct 09 '18 at 19:20

It turned out that there were 2 unrelated issues that caused the RAM overuse.

First, the memmap object was fully read from disk when pickled for the multiprocessing workers.
This duplication of the data allocated 6.7GB extra RAM for each worker.
In order to solve this, I've created a shared RawArray and loaded the data to it, and on each worker I used np.frombuffer.

Second, both X.dot(Q) and (X.T * W) resulted in numpy allocating another X-shaped matrix, which is another 6.7GB RAM
I created a variation of the answer from this thread: https://stackoverflow.com/a/21096605/5572523
Since my matrix is skinny-tall, I sliced over rows:

def _block_slices(dim_size, block_size):
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count >= dim_size:
            raise StopIteration

And now I can iterate over batches of the data (added a bit of extra speedup too when dealing with weight=0)

I set max_elements = 2 ** 27, because I'm using float64, so this results in a 1GB matrix (if I'm not wrong).

So (X.T * W).dot(X) turned to:

def weighted_outer_prod(X, W):
    n, d = X.shape

    max_rows = max(1, int(max_elements / d))
    sigma = np.zeros([d, d])
    for mm in _block_slices(n, max_rows):
        sigma += batch_weighted_outer_prod(X[mm, :], W[mm])
    return sigma

def batch_weighted_outer_prod(batch, W):
    nz = W > 0
    buff = np.empty(batch[nz].shape)
    np.multiply(batch[nz], W[nz, np.newaxis], out=buff)
    sigma = buff.T.dot(batch[nz])
    del(buff)
    return sigma

And np.sum(X.dot(Q) * X, axis=1) turned to: (don't mind the function name)

def calc_l_k(X, Q):
    max_rows = max(1, int(max_elements / d))
    l_k = np.empty(n)
    for mm in _block_slices(n, max_rows):
        l_k[mm] = batch_l_k(X[mm, :], Q)

    return l_k


def batch_l_k(batch, Q):
    buff = np.empty(batch.shape)
    np.dot(batch, Q, out=buff)
    np.multiply(buff, batch, out=buff)
    l_k = -0.5 * np.sum(buff, axis=1)
    del(buff)
    return l_k

Now it runs with X of shape [1mil, 900], and I hope it'll still work with higher dimensions.

Numpy huge matrix dot product while multiprocessing

1 Answers1