Storing a Sparse Numpy Array

Question

I have a 20,000 x 20,000 Numpy matrix that I wish to store by file, where the average volumn only has 12 values in it.

What would be the most efficient way to store only the values in the format of

if array[i][j] == 1:
   file.write("{} {} {{}}\n".format(i, j)

where (i, j) are the indices for the array?

Does the sparse matrix implementation you're using have its own serialization code (e.g. for use with `pickle`)? That might be easier to learn and use than learning enough of its implementation to write your own. — Blckknght, Aug 19 '20 at 19:34
To be clear: you're willing to sacrifice memory for performance (hence loading the values into a normal Numpy array), but wish to conserve disk space? — Karl Knechtel, Aug 19 '20 at 19:41
@Blckknght Right now it's just a numpy array, so I actually don't know, soprry! — TheAkashain, Aug 19 '20 at 19:46
@KarlKnechtel Exactly! I can sacrifice as much memory as necessary to get maximum performance here. It takes only 1 second to generate the array, but a full minute to store it. — TheAkashain, Aug 19 '20 at 19:47
@hpaulj well, that essentially solves the problem, thank you! — TheAkashain, Aug 19 '20 at 20:10

score 5 · Answer 1 · answered Aug 20 '20 at 23:39

You can use scipy to create sparse matrices from dense numpy arrays that only store values with nonzero entries against their indices.

import scipy
import pickle

I = np.eye(10000)  #Had 10000 nonzero values along diagonal
S = scipy.sparse.csr_matrix(I)
S

<10000x10000 sparse matrix of type '<class 'numpy.float64'>'
    with 10000 stored elements in Compressed Sparse Row format>

This is highly memory efficient and you can use pickle to dump / load this sparse matrix when you need it.

#Pickle dump
file = open("S.pickle",'wb') #160kb
pickle.dump(S, file)

#Pickle load
file = open("S.pickle",'rb') 
S = pickle.load(file)

To get back a dense representation you can simply use .toarray() to get back a NumPy array or .todense() to get back a matrix type object.

S.toarray()

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

TheAkashain · Answer 2 · 2020-08-20T23:07:44.630

1

For those reading after the fact: @hpaulj's comment of using "np.nonzero" effectively solves the problem!

Edit: Here is the code I used to solve it!

array1, array2 = np.nonzero(array)
        for i in range(0, array1.size):
            file.write("{} {} {{}}\n".format(array1[i], array2[i]))

edited Aug 20 '20 at 23:07

answered Aug 19 '20 at 20:29

TheAkashain

29
1
6

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/26974599) – Umutambyi Gad Aug 19 '20 at 21:57
@Gad this is the author himself posting an answer based on one of the comments. – Akshay Sehgal Aug 20 '20 at 23:24

Storing a Sparse Numpy Array

2 Answers2