I'm searching for an better way to create a scipy sparse matrix from a pandas dataframe.
Here is the pseudocode for what I currently have
row = []; column = []; values = []
for each row of the dataframe
for each column of the row
add the row_id to row
add the column_id to column
add the value to values
sparse_matrix = sparse.coo_matrix((values, (row, column), shape=(max(row)+1,max(column)+1))
But I personally believe there would be a better way to do things. What almost worked was the following
dataframe.unstack().to_sparse().to_coo()
However, this returned me a triple of (sparse matrix, column ids, and row ids). The issue is that I need the row ids to actually be part of the sparse matrix.
Here is a full example. I have a dataframe that looks like follows
instructor_id primary_department_id
id
4109 2093 129
6633 2093 129
6634 2094 129
6635 2095 129
If I do the operation I mentioned above, I get
ipdb> data = dataframe.unstack().to_sparse().to_coo()[0]
ipdb> data
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 8 stored elements in COOrdinate format>
ipdb> print data
(0, 0) 2093
(0, 1) 2093
(0, 2) 2094
(0, 3) 2095
(1, 0) 129
(1, 1) 129
(1, 2) 129
(1, 3) 129
But I need
ipdb> print data
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
etc.
I am open to using any additional libraries or dependencies.
There seems to be a question that asks for the reverse operation but I haven't found a solution for this operation.