2

I'm trying to convert TF-IDF sparse matrix to json format. Converting it to pandas datafram (toarray() or todense()) causes memory error. So I would like to avoid those approaches. Is there other way to convert it to json ?

Below is my appraoach to get sparse matrix, and my preferred json outcome

Thanks for helping me out ... !


TF-IDF matrix

pip = Pipeline([('hash', HashingVectorizer(ngram_range=(1, 1), non_negative=True)), ('tfidf', TfidfTransformer())])
result_uni_gram = pip.fit_transform(df_news_noun['content_nouns'])

return matrix

result_uni_gram

<112537x1048576 sparse matrix of type '<class 'numpy.float64'>'
    with 12605888 stored elements in Compressed Sparse Row format>



print(result_uni_gram)

(0, 1041232)    0.03397010691200069
(0, 1035546)    0.042603425242006505
(0, 1031141)    0.05579563771771019
(0, 1029045)    0.03985981185871279
(0, 1028867)    0.14591155976555212
(0, 1017328)    0.03827279930970525
:   :
(112536, 9046)  0.04444360144902461
(112536, 4920)  0.07335227778871069
(112536, 4301)  0.06667794684006756

Expecting Outcome

output_json = {
                0: {1041232 : 0.03397, 1035546 : 0.04260, 1031141 : 0.055795 ... }, 
                ...
                ... 112536: {9046 : 0.04444, 4920 : 0.07335, 112536 : 0.06667}
               }

Thanks for helping me out ... !

2 Answers2

2

So I managed to do it like this: Given 'test_samples' is your 'scipy.sparse.csr.csr_matrix'

 import json
 import base64
 np_test_samples=test_samples.toarray()
 jason_test_samples=json.dumps({"data": np_test_samples.tolist()})
  • this would ve a lot more useful if you included instructions to load the serialized data back into Python objects. – rjurney Jan 27 '21 at 21:26
  • @rjurney He is converting it it to a dense array, then to a python list. You would use json.loads(jason_test_samples) to get back the list and use [this](https://stackoverflow.com/a/7922642/841307) solution to convert it to sparse – vandre Mar 28 '23 at 16:25
0

The script below does not have your 'preferred' JSON format, but hopefully it helps anyone else that is trying to convert a sparse-matrix array into JSON and back. Since ndarray is not serializable I converted them to list and created a custom JSON object with them. This is more efficient than doing mat.toarray().tolist() which creates a dense array.

import json
import numpy as np
from scipy.sparse import csr_matrix

row= np.array([0,1])
col = np.array([2,0])
data = np.array([2,3])
mat = csr_matrix((data, (row, col)), shape=(2, 3))

# mat is:
#[[0 0 2]
# [3 0 0]]

json_str = json.dumps({"data": mat.data.tolist(),
 "indices": mat.nonzero()[0].tolist(), "indptr": mat.nonzero()[1].tolist()})

obj = json.loads(json_str)

mat2 = csr_matrix((obj['data'], (obj['indices'], obj['indptr'])))

print((mat != mat2).nnz==0)

print(mat)
vandre
  • 778
  • 6
  • 16