I import binary data from a SQL in a pandas Dataframe consisting of the columns UserId
and ItemId
. I am using implicit/binary data, as you can see in the pivot_table
below.
Dummy data
frame=pd.DataFrame()
frame['Id']=[2134, 23454, 5654, 68768]
frame['ItemId']=[123, 456, 789, 101]
I know how to create a pivot_table
in Pandas using:
print(frame.groupby(['Id', 'ItemId'], sort=False).size().unstack(fill_value=0))
ItemId 123 456 789 101
Id
2134 1 0 0 0
23454 0 1 0 0
5654 0 0 1 0
68768 0 0 0 1
and convert that to a SciPy csr_matrix
, but I want to create a sparse matrix right from the get-go without having to convert from a Pandas df
. The reason for this is that I get an error: Unstacked DataFrame is too big, causing int32 overflow
, because my original data consists of 378.777
rows.
Any help is much appreciated!
I am trying to do the same as these answers Efficiently create sparse pivot tables in pandas?
But I do not have the frame['count']
data yet.