How to covert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?

Question

I have a very large sparse Numpy matrix (of type numpy.ndarray). The matrix is so large that it probably has to be stored in the virtual memory. How can I efficiently convert it to a sparse Scipy matrix (from scipy.sparse)(to be used for arithmetic operations) ?

The following is a direct conversion with dok_matrix, which fails due to a memory issue presumably. Changing dok_matrix to csr_matrix causes the same memory issue.

In [1]: N=int(1e6)

In [2]: from scipy.sparse import dok_matrix

In [3]: import numpy as np

In [4]: mat=np.zeros((N,N))

In [5]: dok_matrix(mat)
zsh: killed     ipython

My current method is to use a nested loop, which is slow even if I do nothing.

In [9]: for i in range(N):
   ...:     for j in range (N):
   ...:         pass

Any efficient solution to convert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?

It's the `mat=np.zeros((N,N))` line that's raising the error. Read the `sparse` docs some more. — hpaulj, Aug 19 '22 at 11:13
@hpaulj But mat=np.zeros((N,N)) gives me no errors -- at least when we have only that sentence. — zell, Aug 19 '22 at 11:48
I get the memory error from `np.zeros`. That could be a OS difference. `dok_matrix((N,N))` works directly, and is much faster (for smaller N). Making a sparse matrix from a dense requires a search for all nonzero terms, sorting them, and then making the matrix. That's a lot more work for an array that is all zeros. The best way to make a sparse matrix is to construct the data/row/col arrays of the `coo_matrix` format. — hpaulj, Aug 19 '22 at 15:10
`dok_matrix` when given a dense array, first makes a `coo`, and then converts it to a `dok`. `coo` does `atleast_2d` and `nonzero()`.to get the `row` and `col`. Those will be length 0, but still require a full scan. — hpaulj, Aug 20 '22 at 01:23

Jérôme Richard · Accepted Answer · 2022-08-19T11:50:04.730

4

When you do mat = np.zeros((N,N)), Numpy allocates a big matrix full of zeros. To do that, it requests a big zeroized buffer to the operating system (OS). Most OS do not actually perform the allocation in physical memory but in virtual memory. The memory pages are mapped during a first-touch. This means any read or write cause virtual pages to be mapped in physical memory. Please see this post for more information about this. Also please consider reading the famous article What Every Programmer Should Know About Memory for more information about virtual memory. The thing is dok_matrix(mat) needs to read the whole matrix so to create the sparse matrix so it will trigger the mapping of all pages of the matrix resulting in an out of memory. When there is no more space left, the Linux's OOM-Killer kills programs using too much memory, hence a killed ipython message. The same problem happens with any kind of sparse matrices. The main problem is you cannot read the whole created matrix. In fact, creating such matrix is nearly pointless unless you know that only some tiny parts will be read/written (never the full matrix).

The solution is to create a space matrix directly and fill it like Numpy dense matrices. It is significantly slower, but this is the price to pay for using sparse matrices. DOK matrices generally takes a lot of memory unless only few items are filled. That being said, DOK matrices are one of the fastest for random accesses (because they are implemented using a hash-table internally). CSR are good for matrices that do not change once created (ie. modifying a CSR matrix is very slow) and with only few non-zero items per line. Note that CSR matrices can be created quite quickly from a data 1D array and (row_ind, col_ind) array tuple. See the documentation for more information.

The sparsity of the matrix, that is, the ratio NumberOfNonZeroValue / NumberOfValues needs to be (very) small for the sparse matrix to be useful. Sparse matrices tends to be slow because their internal representation operates on a kind of compressed matrix.

edited Aug 19 '22 at 11:50

answered Aug 19 '22 at 11:32

Jérôme Richard

41,678
6
29
59

[Here](https://stackoverflow.com/questions/73400948/looking-for-good-indexing-and-sparse-matrix-methods-to-create-a-matrix-from-exis/73416238#73416238) is an answer posted simultaneously for a related question. – Jérôme Richard Aug 19 '22 at 11:53
scipy sparse matrix does not always work well with numpy. For example, a simple L2 norm would fail: sA= sparse.csr_matrix(np.arange(6).reshape(2,-1)); np.linalg.norm(sA) – zell Aug 19 '22 at 12:45
1

@zell yes that's correct, try https://docs.scipy.org/doc/scipy/reference/sparse.linalg.html – CJR Aug 19 '22 at 13:13
2

`np.zeros` gives me `MemoryError: Unable to allocate 7.28 TiB ...`, No `touch` needed. Could that be a OS difference? (Linux). @zell, don't try to use `np` functions on sparse matrices - unless you know they delegate the action to sparse methods. `norm` first does a `np.array(sA)`. `sA.toarray()` is correct conversion to dense. – hpaulj Aug 19 '22 at 15:07
@hpaulj Some system has limitations regarding their configuration. For example, a 32-bit library/Linux/CPython should not be able to allocate more than 4 GiB arrays. Alternatively, your Linux may be tuned to disable overcommit and the hardware. You can check this by fetching `/proc/sys/vm/overcommit_*` files. For more information, please read [this](https://stackoverflow.com/questions/911860). AFAIK, the overcommit mechanism is nearly always enabled by default on mainstream Linux platforms so this is a bit surprising you get this error unless you tuned it. – Jérôme Richard Aug 19 '22 at 18:40
@hpaulj Another possible explanation is that 64-bit processors cannot really address all the 64-bit range of virtual addresses. In fact, AFAIK x86-64 modern processors can address only 48 bits (256 TiB) of virtual memory. Older processors can certainly address only a fraction of this. Moreover, the OS can reserve few bits for its internal usage. – Jérôme Richard Aug 19 '22 at 18:51
1

The default overcommit heuristic for Ubuntu (and most other Linux distros that I'd come into contact with) rejects allocations that are just wildly unreasonable. I'm sure this example would not allocate for me with a memoryerror. I believe osx still allows any overcommit though. – CJR Aug 19 '22 at 19:25
I have a relatively old and small RAM machine (4Gb) – hpaulj Aug 20 '22 at 01:20
Indeed, on Debian/Ubuntu this appear to be the case by default: the OS refuse to allocate more than the RAM+Swap memory in 1 allocation (but doing 1000 allocations like this is fine). `/proc/sys/vm/overcommit_memory` is set to 0 on such system. Switching to 1 caused allocations not to be bounded anymore. Switching to 2 caused the system not to do any overcommit (but the mapping is still not done at allocation time in this case so memory cannot be fully filled in practice). – Jérôme Richard Aug 21 '22 at 10:34

How to covert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?

1 Answers1

Linked