1

I need to save a large sparse csr_matrix and a numpy array to be able to read them back later. Let X be the sparse csr_matrix and Y be the number array.

Currently I take the following slightly insane route.

from scipy.sparse import csr_matrix
import numpy as np
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

save_sparse_csr("file1", X)
np.save("file2", Y)

Then when I want to read them in it is:

X = load_sparse_csr("file1.npz")
Y = np.load("file2.npy")

Two questions:

  1. Is there a better way to save a csr_matrix than this?
  2. Can I save both X and Y to the same file somehow? I seems crazy to have to make two files for this.
Simd
  • 19,447
  • 42
  • 136
  • 271
  • 2
    Not entirely sure, but I think scipy's loadmat supports structured arrays and as such different variable could be saved as fields : http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html – Divakar Jul 21 '16 at 20:54
  • i would try to save it as [PyTables](http://stackoverflow.com/questions/11129429/storing-numpy-sparse-matrix-in-hdf5-pytables) – MaxU - stand with Ukraine Jul 21 '16 at 20:56
  • @Divakar, could you please take a look at this [question](http://stackoverflow.com/questions/38506360/randomly-concat-data-frames-by-row) if you will have some time? I'm pretty sure you will find more elegant numpy solution... – MaxU - stand with Ukraine Jul 21 '16 at 20:58
  • `loadmat` handles sparse matrices - in a MATLAB compatible format. I don't think it uses `numpy` structured arrays for this. – hpaulj Jul 22 '16 at 03:50

1 Answers1

0

So you are saving the 3 array attributes of the csr along with its shape. And that is sufficient to recreate the array, right?

What's wrong with that? Even if you find a function that saves the csr for you, I bet it is doing the same thing - saving those same arrays.

The normal way in Python to save a class is to pickle it. But the class has to create the appropriate pickle method. numpy does that (essentially its save function). But as far as I know scipy.sparse has not provided that.

Since scipy.sparse has its roots in the MATLAB sparse code (and C/Fortran code developed for linear algebra problems), it can load/save using the loadmat/savemat functions. I'd have to double check but I think the work with csc the default MATLAB sparse format.

There are one or two other sparse.io modules than handle sparse, but I have worked with those. There formats for sharing sparse arrays among different packages working with the same problems (for example PDEs or finite element). More than likely those formats will use a coo compatible layout (data, rows, cols), either as 3 arrays, a csv of 3 columns, or 2d array.

Mentioning coo format raises another possibility. Make a structure array with data, row, col fields, and use np.save or even np.savetxt. I don't think it's any faster or cleaner than csr direct. But it does put all the data in one array (but shape might still need a separate entry).

You might also be able to pickle the dok format, since it is a dict subclass.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Can I put it all in one file easily? That is X and Y? – Simd Jul 22 '16 at 04:54
  • 1
    `np.save` puts just one array in a file. `np.savez` saves multiple arrays (one per file in the archive), just as you are doing with the 3 attributes of `X`. You could include `Y` in that collection. There's no requirement that the arrays match in shape or type. Your `indptr` does not match the `indices` in shape. – hpaulj Jul 22 '16 at 05:15