Memory efficient dot product between a sparse matrix and a non-sparse numpy matrix

Question

I have gone through similar questions that has been asked before (for example [1] [2]). However, none of them completely relevant for my problem.

I am trying to calculate a dot product between two large matrices and I have some memory constraint that I have to meet.

I have a numpy sparse matrix, which is a shape of (10000,600000). For example,

from scipy import sparse as sps
x = sps.random(m=10000, n=600000, density=0.1).toarray()

The second numpy matrix is of size (600000, 256), which consists of only (-1, 1).

import numpy as np
y = np.random.choice([-1,1], size=(600000, 256))

I need dot product of x and y at lowest possible memory required. Speed is not the primary concern.

Here is what I have tried so far:

Scipy Sparse Format:

Naturally, I converted the numpy sparse matrix to scipy csr_matrix. However, task is still getting killed due to memory issue. There is no error, I just get killed on the terminal.

from scipy import sparse as sps
sparse_x = sps.csr_matrix(x, copy=False)
z = sparse_x.dot(y)
# killed

Decreasing dtype precision + Scipy Sparse Format:

from scipy import sparse as sps

x = x.astype("float16", copy=False)
y = y.astype("int8", copy=False)

sparse_x = sps.csr_matrix(x, copy=False)
z = sparse_x.dot(y)
# Increases the memory requirement for some reason and dies

np.einsum

Not sure if it helps/works with sparse matrix. Found something interesting in this answer. However, following doesn't help either:

z = np.einsum('ij,jk->ik', x, y)
# similar memory requirement as the scipy sparse dot

Suggestions?

If you have any suggestions to improve any of these. Please let me know. Further, I am thinking in the following directions:

It would be great If I can get rid of dot product itself somehow. My second matrix (i.e. y is randomly generated and it just has [-1, 1]. I am hoping if there is way I could take advantage of its features.
May be diving dot product into several small dot product and then, aggregate.

What's the sparsity of the sparse matrix? Edit: whoops, just saw it's 0.1. — Michael Ruth, Dec 21 '22 at 22:23
Have you tried other sparse matrix representations? Some are highly dependent on the distribution of the data in order to significantly lower memory usage. — Michael Ruth, Dec 21 '22 at 22:27
`x.dot(y)` when `x` and `y` are two dimensional is just `x @ y`. And given your density and the size of the array, the product of the arrays is *not* going to be sparse. The probability that any complete row of the sparse matrix is all zeros is virtually zero. — Frank Yellin, Dec 21 '22 at 22:31
Just did a quick calculation. Just the sparse matrix values, i.e. without index information, is ~156GiB using float16 as the data type. I think you may need to store the data on disk and load only what you need for each vector operation, potentially storing the result on disk as well. A quick search suggests that PyTables and HDF5 are appropriate tools for this. — Michael Ruth, Dec 21 '22 at 22:50
@MichaelRuth Yes, I have tried coo and bsr. I couldn't see much difference. — Grayrigel, Dec 21 '22 at 23:06
@MichaelRuth Thanks for the suggestion. Yeah. Unfortunately, I am not able to save it on disk as this dot product is part of bigger pipeline, which run on pyspark as a udf. — Grayrigel, Dec 21 '22 at 23:09
What's the maximum amount of memory available for this? First we need to assess whether this is even feasible. — Michael Ruth, Dec 21 '22 at 23:11
@MichaelRuth I don't have exact number as the resources are shared on the cluster. However, let's say 1.5~2GB spare memory for x.dot(y), which could be not feasible [x,y are the present in the memory]. Anything slightly better should than scipy sparse dot should be good. — Grayrigel, Dec 21 '22 at 23:18
@MichaelRuth Do you envision a way not using a dot product? May be some approximation for dot product? — Grayrigel, Dec 21 '22 at 23:21
The product may be approximated many ways. One such way is by sampling, described [here](https://www.stat.berkeley.edu/~mmahoney/f13-stat260-cs294/Lectures/lecture02.pdf). Another approximation technique, used by LM-BFGS, considers information specific to the problem. — Michael Ruth, Dec 21 '22 at 23:46
Try your option 2. The product has shape (10000, 256), and which should easily fit in memory. So compute it in batches. — Warren Weckesser, Dec 22 '22 at 01:25
@MichaelRuth, only `csr` implements `dot`.. Plus the array is random without structure. — hpaulj, Dec 22 '22 at 01:32
`copy=False` is useless when using `astype` to change size; it has to make a copy. Similarly it does nothing when making a sparse matrix from a dense one. It has to make a copy of the data. I'm not sure, but it's likely that that the `csr` multiplication code is compiled for the common `C` types, float and double. You could test that with small examples that don't have a memory issue. Float * int will result in a float, not a small int. — hpaulj, Dec 22 '22 at 01:44
In your examples you should not use the same `x` variable name for dense and sparse arrays. It almost looks like you are trying to use the sparse matrix in the `einsum`, which does not make sense. — hpaulj, Dec 22 '22 at 01:53
When you get memory errors, does the process just die (killed) without explanation, or do you get a memory error, with traceback? I've seen both happen. Sometimes we get information such as 'cannot allocated ... Tbytes for array with shape (....)`. I don't like testing memory error cases. — hpaulj, Dec 22 '22 at 03:47
@hpaulj Thanks for the valuable suggestions. I will edit it accordingly. The processes just die with killed. No explanation. — Grayrigel, Dec 22 '22 at 07:21
@hpaulj I am forced to use astype as numpy (in certain functions e.g. choice) doesn't allow me to choose dtype upon creation. What do you suggest there? — Grayrigel, Dec 22 '22 at 10:13
You can multiply sparse and dense arrays directly into a preallocated dense array with the MKL sparse functions, and there's a python wrapper for them. — CJR, Dec 22 '22 at 12:41

Jérôme Richard · Accepted Answer · 2022-12-26T23:30:23.483

TL;DR: SciPy consumes significantly more memory than strictly needed for this due to temporary arrays, type promotion, and due to an inefficient usage. It is also not very fast. The setup can be optimized so to use less memory and Numba can be used to perform the computation efficiently (both for the memory usage and time).

I have a numpy sparse matrix, which is a shape of (10000,600000). For example,
x = sps.random(m=10000, n=600000, density=0.1).toarray()

sps.random(...) is a COO sparse matrix so it is relatively compact in memory. Using .toarray() on it causes the sparse matrix to be converted to a huge dense matrix. Indeed, this resulting dense matrix (x) takes 10000*600000*8 = 44.7 GiB (since the default type for floating-point numbers is 64-bit wide).

This can likely cause memory issue. On some machines with a swap memory or a large RAM (eg. 64 GiB), the program can be much slower and allocating only few GiB of RAM can cause the process to be kill if the memory is closed to be saturated (eg. due to OOM killer on Linux). Note that Windows compresses data in RAM when the remaining memory is pretty limited (ie. ZRAM). This methods works well on arrays having a lot of zeros. However, when additional arrays are allocated later and the dense array is read back from RAM, the OS needs to uncompress data from the RAM and it needs more space. If the rest of the RAM content cannot be compressed so well during the decompression, an out of memory is likely to occur.

Naturally, I converted the numpy sparse matrix to scipy csr_matrix

CSR matrices are encoded as 3 1D arrays:

a data array containing the non-zero value of the source matrix;
a column index array referencing the location of the non-zero items of the current row;
a row index referencing the offset of the first item in the two first arrays.

In practice, the 2 last array are 32-bit integer arrays and likely 64-bit integers on Linux. Since your input matrix has 10% of non-zero values, this means data should take 10000*600000*8*0.1 = 4.5 GiB, column should take also the same size on Linux (since they both contains 64-bit values) and 2.2 GiB on Windows, row should take 10000*8 = 78 KiB on Linux and be even smaller on Windows. Thus, the CSR matrix should take about 9 GiB overall on Linux and 6.7 GiB on Windows. This is still pretty big.

Using copy=False is not a good idea since the CSR matrix will reference the huge initial array (AFAIK data will reference x). This means x needs to be kept in memory and so the resulting memory spec is about 44.7 GiB + 2.2~4.5 GiB = 46.9~49.2 GiB. This is not to mention matrix multiplication involving sparse matrices are generally slower (unless the sparsity factor is very small). It is better to copy the array content so to only keep the non-zero values with copy=True.

x.astype("float16", copy=False)

This is not possible to convert an array to another one with a different data-type without copying it especially if the size of each item is different in memory. Even if it would be, this make no sense to do that since the goal is to reduce the size of the input so to create a new array and not to keep the initial one.

# Increases the memory requirement for some reason and dies

There are 2 reasons for this.

Firstly, Scipy does not support the float16 data-type yet for sparse matrices. You can see this by checking sparse_x.dtype: it is float32 when x is float16. This is because the float16 data-type is not supported natively on most platforms (x86-64 processors mainly support the conversion from/to this type). As the result, the CSR matrix data part is twice bigger than what it could be.

Secondly, Numpy does not internally support binary operations on arrays with different data-types. Numpy first converts the inputs of binary operations so they matches. To do this, it follows semantics rules. This is called type promotion. SciPy generally works the same way (especially since it often makes use of Numpy internally). This means the int8 array will likely implicitly converted in a float32 array in your case, that is, a 4 times bigger array in memory. We need to delve into the implementation of SciPy to understand what is really happening.

Under the hood

Let's understand what is going on internally in SciPy when we do the matrix multiplication. As pointed out by @hpaulj, sparse_x.dot(y) calls _mul_multivector which creates the output array and calls csr_matvecs. The last is a C++ wrapped function calling a template function instantiated by the macro SPTOOLS_CSR_DEFINE_TEMPLATE. This macro is provided to the other macro SPTOOLS_FOR_EACH_INDEX_DATA_TYPE_COMBINATION responsible for generating all the possible instances from a list of predefined supported types. The implementation of csr_matvecs is available here.

Based on this code, we can see the float16 data-type is not supported by SciPy (at least for sparse matrices). We can also see that self and other must have the same type when calling the C++ function csr_matvecs (the type promotion is certainly done before calling the C++ function using in a wrapping code like this one). We can also see that the C++ implementation is not particularly optimized and it is also simple. One can easily write a code having the same logic and the same performance using Numba or Cython so to support your specific compact input type. We can finally see than nothing is allocated in the C++ computing part (only in the Python code and certainly in the C++ wrapper).

Memory efficient implementation

First of all, here is a code to set the array in a compact way:

from scipy import sparse as sps
import numpy as np

sparse_x = sps.random(m=10_000, n=600_000, density=0.1, dtype=np.float32)
sparse_x = sps.csr_matrix(sparse_x) # Efficient conversion

y = np.random.randint(0, 2, size=(600_000, 256), dtype=np.int8)
np.subtract(np.multiply(y, 2, out=y), 1, out=y) # In-place modification

One simple pure-Numpy solution to mitigate the overhead of temporary arrays is to compute the matrix multiplication by chunk. Here is an example computing the matrix band by band:

chunk_count = 4
m, n = sparse_x.shape
p, q = y.shape
assert n == p
result = np.empty((m, q), dtype=np.float16)
for i in range(chunk_count):
    start, end = m*i//chunk_count, m*(i+1)//chunk_count
    result[start:end,:] = sparse_x[start:end,:] @ y

This is <5% slower on my machine and it takes less memory since the temporary arrays of only 1 band of the output matrix are created at a time. If you still have memory issues using this code, then please check if the output matrix can actually fit in RAM using: for l in result: l[:] = np.random.rand(l.size). Indeed, creating an array does not mean the memory space is reserved in RAM (see this post).

A more memory efficient solution and faster one is to use Numba or Cython so to do what Scipy does manually without creating any temporary array. The bad news is that Numba does not support the float16 data-type yet so float32 needs to be used instead. Here is an implementation:

import numba as nb

# Only for CSR sparse matrices
@nb.njit(['(float32[::1], int32[::1], int32[::1], int8[:,::1])', '(float32[::1], int64[::1], int64[::1], int8[:,::1])'], fastmath=True, parallel=True)
def sparse_compute(x_data, x_cols, x_rows, y):
    result = np.empty((x_rows.size-1, y.shape[1]), dtype=np.float32)
    for i in nb.prange(x_rows.size-1):
        line = np.zeros(y.shape[1], dtype=np.float32)
        for j in range(x_rows[i], x_rows[i+1]):
            factor = x_data[j]
            y_line = y[x_cols[j],:]
            for k in range(line.size):
                line[k] += y_line[k] * factor
        for k in range(line.size):
            result[i, k] = line[k]
    return result

z = sparse_compute(sparse_x.data, sparse_x.indices, sparse_x.indptr, y)

This is significantly faster than the previous solution and it also consumes far less memory. Indeed, it consumes only 10 MiB to compute the result (that is 50-200 times less than the initial solution on my machine), and it is about 3.5 times faster than SciPy on a 10 year old mobile processor with only 2 core (i7-3520M)!

Generalization on LIL sparse matrices

The above Numba code is only for CSR matrices. LIL matrices are not efficient. They are designed for building matrices quickly, but inefficient for computations. The doc say to convert them to CSR/CSC matrices (once created) to do computations. LIL matrices are internally 2 Numpy array of CPython list objects. Lists are inefficient and cannot be much optimized by Numba/Cython/C because of how CPython lists are designed. They also use a lot of memory (COO is certainly better for that). Indeed, on mainstream 64-bit platforms and using CPython, each list item is an object and a CPython object typically takes 32 bytes, not to mention the 8 byte taken by the reference in the CPython list. 2 objects are needed per non-zero value, so 80 byte per non-zero value. As a result, the resulting LIL input matrix is pretty huge.

Since Numba does not really support CPython lists, lists can be converted to Numpy arrays on the fly so to be able to perform a relatively fast memory-efficient computation. Fortunately, this conversion is not so expensive on large matrices. Here is an implementation:

@nb.njit('(float32[::1], int32[::1], int8[:,::1], float32[::1])', fastmath=True)
def nb_compute_lil_row(row_data, row_idx, y, result):
    assert row_data.size == row_idx.size
    assert result.size == y.shape[1]
    for i in range(row_data.size):
        factor = row_data[i]
        y_line = y[row_idx[i],:]
        for k in range(y_line.size):
            result[k] += y_line[k] * factor

def nb_compute_lil(sparse_x, y):
    result = np.zeros((sparse_x.shape[0], y.shape[1]), dtype=np.float32)
    for i in range(len(sparse_x.rows)):
        row_data = np.fromiter(sparse_x.data[i], np.float32)
        row_idx = np.fromiter(sparse_x.rows[i], np.int32)
        nb_compute_lil_row(row_data, row_idx, y, result[i])
    return result

z = nb_compute_lil(sparse_x, y)

This code is twice slower than Scipy on my machine but it consumes less memory. The worst performance is due to Numba failing to generate a fast SIMD code in this specific case (due to missed optimizations of the internal LLVM-JIT compiler).

Thanks a ton such a detailed and beautiful answer. This appears miles better than what I had a mind. Let me try it out. About the `.toarray()` , it was only meant to generate the data. I have a sparse numpy array to begin with, which I later convert to scipy sparse matrix. — Grayrigel, Dec 26 '22 at 16:20
Indeed, this seems to better memory wise. A quick question: How do I modify `sparse_compute` for sps.lil_matrix. I get a warning `SparseEfficiencyWarning` for using lil_matrix. According to the documentation, it seems to be a better choice for changing sparsity. — Grayrigel, Dec 26 '22 at 17:05
Don't use lil matrices for any reason the memory overhead on python objects is horrifying. That 9GB CSR matrix will be something like 90GB as a LIL matrix and be functionally unusable even if you can allocate it. — CJR, Dec 27 '22 at 02:49
@CJR Thanks for the suggestion. I saw a decline in performance when using the `LIL` matrix. So, I am continuing to use `CSR` and ignoring the `SparseEfficiencyWarning`. It works very fast. — Grayrigel, Dec 28 '22 at 00:26

hpaulj · Answer 2 · 2022-12-22T04:01:25.930

The deepest python code for M@x, where M is a sparse csr matrix, and x is a dense array is:

In [28]: M._mul_multivector??
Signature: M._mul_multivector(other)
Docstring: <no docstring>
Source:   
    def _mul_multivector(self, other):
        M, N = self.shape
        n_vecs = other.shape[1]  # number of column vectors

        result = np.zeros((M, n_vecs),
                          dtype=upcast_char(self.dtype.char, other.dtype.char))

        # csr_matvecs or csc_matvecs
        fn = getattr(_sparsetools, self.format + '_matvecs')
        fn(M, N, n_vecs, self.indptr, self.indices, self.data,
           other.ravel(), result.ravel())

        return result

The fn here is compiled code. It gets the attribute arrays of the csr matrix, the dense array as a 1d array, and the preallocated result array - also as a flattened view.

I wonder the memory error occurs when creating result. Because I suspect fn uses that memory as its workspace, and does not try use any more memory.

I know that the sparse@sparse matrix multiplication is done in two compiled steps. The first determines the dimensions and nnz of the result, and the second actually does the calculation. The python code between the two steps creates the result array, much as is being done here. And memory errors with this @ occur in the result allocation.

Ali_Sh · Answer 3 · 2022-12-24T17:39:11.377

If you have created your arrays without any memory problem (that I had for the prepared example sizes), why not just using iterations if could satisfy the needs. The dot product equivalent loops could be a solution, which can be accelerated with numba:

@nb.njit  # ("float64[:, ::1](float64[:, ::1], int32[:, ::1])")
def dot(a, b):

    dot_ = np.zeros((a.shape[0], b.shape[1]))
    for i in range(a.shape[1]):
        for j in range(a.shape[0]):
            for k in range(b.shape[1]):
                dot_[j, k] += a[j, i] * b[i, k]
    return dot_

the result will easily fit in memory for your prepared example as Warren mentioned in the comments.

Creating 1D (-1, 1) random arrays with shape (x.shape[1]) in loops (based on the desired y_shape) and dotting by each x row to fill an empty array with the desired shape (x.shape[0], y_shape) might be helpful to satisfy the memory constraint (perhaps better to use jitted numba loops instead np.dot) if there not be any other way, which is very inefficient in time (it could be handled by batches to avoid many of the loops), i.e.,:

def batch(a, y_shape):
    res = np.zeros((a.shape[0], y_shape))
    for row_ in range(a.shape[0]):
        for col_ in range(y_shape):
            res[row_, col_] = np.dot(a[row_, :], np.random.choice([-1, 1], size=a.shape[1]))
    return res

`why not just using iterations` there's a pretty good reason why it's best to just use BLAS for matrix math and not to reinvent that wheel. This has no real advantages over `dgemm` (which numpy wraps under the hood) and a bunch of disadvantages (it's not parallelized, doesn't use registers efficiently, etc). — CJR, Dec 23 '22 at 19:56
@CJR , right, there are differences between the methods, and this solution needs to be evaluated more based on the needs of the case (in this question, I didn't see any constraint for using this solution). I have updated the answer with a small hint to involve it. The code could be parallelized or written in other styles, but these are not the main concern for the issue. In a test, this solution consumed less memory, which could satisfy the author needs. — Ali_Sh, Dec 24 '22 at 00:08