1

I have a dict of scipy.sparse.csr_matrix objects as values, with integer keys. How can I save this in a separate file?

If I had a regular ndarray for each entry, then I could serialize it with json, but when I try this with a sparse matrix:

    with open('filename.txt', 'w') as f:
            f.write(json.dumps(the_matrix))

I get a TypeError:

TypeError: <75x75 sparse matrix of type '<type 'numpy.int64'>' with 10 stored elements in Compressed Sparse Row format> is not JSON serializable

How can I save my dictionary with keys that are integers and values that are sparse csr matrices?

StatsSorceress
  • 3,019
  • 7
  • 41
  • 82

3 Answers3

1

I faced this same issue trying to save a dictionary whose values are csr_matrix. Dumped it to disk using pickle. file handler should be opened in "wb" mode.

import pickle
pickle.dump(csr_dict_obj, open("csr_dict.pkl","wb"))

load the dict back using.

csr_dict = pickle.load(open("csr_dict.pkl","rb"))
greenlantern
  • 374
  • 1
  • 3
  • 15
0

Newer scipy versions have a scipy.sparse.save_npz function (and corresponding load). It saves the attributes of a sparse matrix to a numpy savez zip archive. In the case of a csr is saves the data, indices and indptr arrays, plus shape.

scipy.io.savemat can save a sparse matrix in a MATLAB compatible format (csc). There are one or two other scipy.io formats that can handle sparse matrices, but I haven't worked with them.

While a sparse matrix contains numpy arrays it isn't an array subclass, so the numpy functions can't be used directly.

The pickle method for numpy arrays is its np.save. And an array that contains objects, uses pickle (if possible). So a pickle of a dictionary of arrays should work.

The sparse dok format is a subclass of dict, so might be pickleable. It might even work with json. But I haven't tried it.

By the way, a plain numpy array can't be jsoned either:

In [427]: json.dumps(np.arange(5))
TypeError: array([0, 1, 2, 3, 4]) is not JSON serializable
In [428]: json.dumps(np.arange(5).tolist())
Out[428]: '[0, 1, 2, 3, 4]'

dok doesn't work either. The keys are tuples of indices,

In [433]: json.dumps(M.todok())
TypeError: keys must be a string

MatrixMarket is a text format that handles sparse:

In [444]: io.mmwrite('test.mm', M)   
In [446]: cat test.mm.mtx
%%MatrixMarket matrix coordinate integer general
%
1 5 4
1 2 1
1 3 2
1 4 3
1 5 4
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks @hpaulj, but when I was referring to the `json` serializable object, I was referring to the `dict`, which is serializable. A `dict` of numpy arrays can be serialized. A `dict` of scipy sparse csr matrices cannot. I'm looking for a way to save a `dict` where the keys are integers and the values are scipy sparse csr matrices, – StatsSorceress Dec 04 '17 at 16:45
  • To serialize a dictionary you have to be able to serialize all of its values - that is, the serialization has to be defined for each component object. – hpaulj Dec 04 '17 at 17:55
  • Okay, let's leave serialization alone then, since sparse matrices are not serializable. How can I save a dict whose values are non-serializable? – StatsSorceress Dec 04 '17 at 17:58
0
import numpy as np
from scipy.sparse import lil_matrix, csr_matrix, issparse
import re
def save_sparse_csr(filename, **kwargs):
    arg_dict = dict()
    for key, value in kwargs.items():
        if issparse(value):
            value = value.tocsr()
            arg_dict[key+'_data'] = value.data
            arg_dict[key+'_indices'] = value.indices
            arg_dict[key+'_indptr'] = value.indptr
            arg_dict[key+'_shape'] = value.shape
        else:
            arg_dict[key] = value

    np.savez(filename, **arg_dict)

def load_sparse_csr(filename):
    loader = np.load(filename)
    new_d = dict()
    finished_sparse_list = []
    sparse_postfix = ['_data', '_indices', '_indptr', '_shape']

    for key, value in loader.items():
        IS_SPARSE = False
        for postfix in sparse_postfix:
            if key.endswith(postfix):
                IS_SPARSE = True
                key_original = re.match('(.*)'+postfix, key).group(1)
                if key_original not in finished_sparse_list:
                    value_original = csr_matrix((loader[key_original+'_data'], loader[key_original+'_indices'], loader[key_original+'_indptr']),
                                      shape=loader[key_original+'_shape'])
                    new_d[key_original] = value_original.tolil()
                    finished_sparse_list.append(key_original)
                break

        if not IS_SPARSE:
            new_d[key] = value

    return new_d

You can write a wrapper as shown above.

PLNewbie
  • 69
  • 7