0

i try to create with scipy.sparse a matrix from json file.

I have json file in this way

{"reviewerID": "A10000012B7CGYKOMPQ4L", "asin": "000100039X", "reviewerName": "Adam", "helpful": [0, 0], "reviewText": "Spiritually and mentally inspiring! A book that allows you to question your morals and will help you discover who you really are!", "overall": 5.0, "summary": "Wonderful!", "unixReviewTime": 1355616000, "reviewTime": "12 16, 2012"} 

this is my Json format...more elements like this(based on Amazon Review file)

and want performe a scipy sparse for have this matrix

    count            
object       a   b   c   d
id                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

i m trying to do this

i

mport numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

df= pd.read_json('C:\\Users\\anto-\\Desktop\\university\\Big Data computing\\Ex. Resource\\test2.json',lines=True)


a= df['reviewerID']
b= df['asin']
data= df.groupby(["reviewerID"]).size()



row = df.reviewerID.astype('category', categories=a).cat.codes
col = df.asin.astype('category', categories=b).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(a), len(b)))

reading from this old example

Efficiently create sparse pivot tables in pandas?

I have some error for deprecates element in my code, but i dont underestand how to costruct this matrix.

this is the error log:

 FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
  from ipykernel import kernelapp as app

I m bit confused. Anyone can give me some suggestion or similar example?

theantomc
  • 619
  • 2
  • 7
  • 32
  • 1
    Barring the fact that your input is not proper JSON, loading it as a full matrix into pandas somewhat defeats the purpose of making it sparse, no? I would expect that you would have better luck with the standard Python json module. – Mad Physicist Nov 04 '18 at 17:01
  • Normally we ask for the actual errors and tracebacks, so we can see exactly what the problem(s) is and where it occurs. You don't show us a real `JSON` string or file. You don't show the resulting data frame. Nor the resulting `data, row, col` arrays. – hpaulj Nov 04 '18 at 17:10
  • @MadPhysicist i want sparse rappresentation for calculate similarity function. You are right my JSON is wrong rappresentation, i corrected. – theantomc Nov 04 '18 at 18:59
  • @hpaulj i try with this stuff, so i try to lunch some times, and i had different error in different times. I'm not interested in solution to my problem, i just want understand the procedure, so i can apply to all file. My data,row,col i created based from example given in link...I dont think so that need too. – theantomc Nov 04 '18 at 19:02
  • Future warnings aren't real errors. They warn you potential problems later on, but they don't prevent current code from running. The place to explore is the pandas documentation (its `astype` etc.). But is there a problem with the `sparse_matrix` variable? – hpaulj Nov 04 '18 at 19:19
  • @hpaulj sorry for confusion...last error is ValueError: Categorical categories must be unique ..when i create the element raw and col. Yes the sparse matrix don't create in this way. I try to use the pivot function too, but i have a different matrix, no final output that i want (because my imput JSON are very complex, with many fields). Really thanks for support – theantomc Nov 04 '18 at 19:36
  • I'm not sure what you fixed. Can you show the input in an actual JSON format you use rather than the full of l path of your file? – Mad Physicist Nov 04 '18 at 20:18

1 Answers1

0

To produce a sparse matrix that looks like

    count            
object       a   b   c   d
id                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

You need to generate 3 arrays like:

In [215]: from scipy import sparse
In [216]: data = np.array([1,1,1,1,1,1])
In [217]: row = np.array([1,2,0,2,0,1])
In [218]: col = np.array([0,0,1,2,3,3])
In [219]: M = sparse.csr_matrix((data, (row, col)), shape=(3,4))
In [220]: M
Out[220]: 
<3x4 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [221]: M.A
Out[221]: 
array([[0, 1, 0, 1],
       [1, 0, 0, 1],
       [1, 0, 1, 0]], dtype=int64)

Categories like 'him','me','you' have to be mapped onto unique indices like 0,1,2. Likewise for 'a','b','c','d'.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • So i need to trasform my json field in array like data row and col ? thanks for you answear – theantomc Nov 04 '18 at 20:36
  • Supposedly that's what the dataframe grouping and categorizing does for you. But I'll let others explain that process. – hpaulj Nov 04 '18 at 20:46
  • Ok @hpaulj i got the structure, but how to set row for my element like "reviewerID" from my JSON and col like "asin" from JSON too? Moreover i don't have "data" array, because my value in data are choose if "id" have a "asin", otherwhise i have NaN value like i wrote in output structure. I need preprocessing this values? – theantomc Nov 04 '18 at 22:10
  • Pandas sparse allows fills like `nan`, but scipy sparse just stores non-zero values - as my example showa. – hpaulj Nov 04 '18 at 23:55