1

I import binary data from a SQL in a pandas Dataframe consisting of the columns UserId and ItemId. I am using implicit/binary data, as you can see in the pivot_table below.

Dummy data

frame=pd.DataFrame()
frame['Id']=[2134, 23454, 5654, 68768]
frame['ItemId']=[123, 456, 789, 101]

I know how to create a pivot_table in Pandas using:

print(frame.groupby(['Id', 'ItemId'], sort=False).size().unstack(fill_value=0))

ItemId  123  456  789  101
Id
2134      1    0    0    0
23454     0    1    0    0
5654      0    0    1    0
68768     0    0    0    1

and convert that to a SciPy csr_matrix, but I want to create a sparse matrix right from the get-go without having to convert from a Pandas df. The reason for this is that I get an error: Unstacked DataFrame is too big, causing int32 overflow, because my original data consists of 378.777 rows.

Any help is much appreciated!

I am trying to do the same as these answers Efficiently create sparse pivot tables in pandas?

But I do not have the frame['count'] data yet.

Louis
  • 25
  • 6

2 Answers2

1

Using the 4th option to instantiate the matrix:

Id = [2134, 23454, 5654, 68768]
ItemId = [123, 456, 789, 101]

csrm = csr_matrix(([1]*len(Id), (Id,ItemId)))

Result:

<68769x790 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in Compressed Sparse Row format>
Stef
  • 28,728
  • 2
  • 24
  • 52
0

I am assuming that you can somehow read the lines of data values into separate lists in memory, i.e., like you did it in your example (having lists for the Id and ItemId). According to the comments on your post, we also do not expect duplicates. Note that the following will not work, if you have duplicates!

The presented solution also introduces a (sparse) matrix that is not as dense as shown in the example, as we will directly use the Id values as matrix/row entries.

To pass them to the constructor, if you're having a look at the SciPy documentation:

csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])

where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k].

Meaning we can directly pass the lists as indices to our sparse matrix as follows:

from scipy.sparse import csr_matrix
Id_values = load_values() # gets the list of entries as in the post example
ItemId_values = load_more_values()

sparse_mat = csr_matrix(([1]*len(Id_values), # entries will be filled with ones
                        (Id_values, ItemId_values)), # at those positions
                        shape=(max(Id_values)+1, max(ItemId_values)+1)) # shape is the respective maximum entry of each dimension

Note that this will not give you any sorting, but instead put the values at their respective Id position, i.e. the first pair would be held at position (2134, 134) instead of (0, 0)

dennlinger
  • 9,890
  • 1
  • 42
  • 63
  • extremely interesting! the combinations of Id*Items (so the positions in the sparse matrix) are never duplicated, however there are duplicate Ids and duplicate Items. As I understand, that is not a problem? – Louis Jul 09 '19 at 12:04
  • Yeah exactly! It basically enters a different row/column positions. This is why I was asking about the duplicates ;-) – dennlinger Jul 09 '19 at 12:07