3

I'm searching for an better way to create a scipy sparse matrix from a pandas dataframe.

Here is the pseudocode for what I currently have

row = []; column = []; values = []
for each row of the dataframe
    for each column of the row
        add the row_id to row
        add the column_id to column
        add the value to values
sparse_matrix = sparse.coo_matrix((values, (row, column), shape=(max(row)+1,max(column)+1))

But I personally believe there would be a better way to do things. What almost worked was the following

dataframe.unstack().to_sparse().to_coo()

However, this returned me a triple of (sparse matrix, column ids, and row ids). The issue is that I need the row ids to actually be part of the sparse matrix.

Here is a full example. I have a dataframe that looks like follows

          instructor_id  primary_department_id
id
4109           2093                    129
6633           2093                    129
6634           2094                    129
6635           2095                    129

If I do the operation I mentioned above, I get

ipdb> data = dataframe.unstack().to_sparse().to_coo()[0]
ipdb> data
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 8 stored elements in COOrdinate format>
ipdb> print data
  (0, 0)    2093
  (0, 1)    2093
  (0, 2)    2094
  (0, 3)    2095
  (1, 0)    129
  (1, 1)    129
  (1, 2)    129
  (1, 3)    129

But I need

ipdb> print data
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  etc.

I am open to using any additional libraries or dependencies.

There seems to be a question that asks for the reverse operation but I haven't found a solution for this operation.

Community
  • 1
  • 1
Kent Shikama
  • 3,910
  • 3
  • 22
  • 55
  • 1
    "The issue is that I need the row ids to actually be part of the sparse matrix" - could you explain what you mean by this? – ali_m May 07 '16 at 03:10
  • A full working example program would help, with hard-coded input data. I'm not sure why you'd want to turn a full, dense DataFrame into a sparse matrix--are you sure you want that? Why? – John Zwinck May 07 '16 at 03:41
  • Have you looked at the pandas version of sparse? There are a couple of recent questions about going back and forth between that and scipy sparse. http://pandas.pydata.org/pandas-docs/stable/sparse.html – hpaulj May 07 '16 at 03:50
  • @JohnZwinck Well I start out with a pandas sparse matrix. I suppose it wouldn't be an efficient idea on a large dataset to convert it to a dense matrix. – Kent Shikama May 07 '16 at 04:10
  • @ali_m I've added an example; thanks for pointing out potential areas of confusion. – Kent Shikama May 07 '16 at 04:10
  • http://stackoverflow.com/questions/33457626/huge-sparse-dataframe-to-scipy-sparse-matrix-without-dense-transform – hpaulj May 07 '16 at 04:12
  • @hpaulj Yeah unfortunately it only documents switching between pandas SparseSeries and scipy sparse matricies – Kent Shikama May 07 '16 at 04:12
  • @hpaulj Yeah I read that answer before I posted this question. – Kent Shikama May 07 '16 at 04:14

2 Answers2

2

I don't have pandas installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe (doesn't a method or attribute like values do that?):

In [40]: D
Out[40]: 
array([[4109, 2093],    # could be other columns
       [6633, 2093],
       [6634, 2094],
       [6635, 2095]])

Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:

In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),
   shape=(7000,1))

In [42]: M
Out[42]: 
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in COOrdinate format>

In [43]: print(M)
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095

=======================

Generalized to two 'data' columns

In [70]: D
Out[70]: 
array([[4109, 2093,  128],
       [6633, 2093,  129],
       [6634, 2094,  127],
       [6635, 2095,  126]])

In [76]: i,j,data=[],[],[]

In [77]: for col in range(1,D.shape[1]):
    i.extend(D[:,0])
    j.extend(np.zeros(D.shape[0],int)+(col-1))
    data.extend(D[:,col])
   ....:     

In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]

In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]

In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]

In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))

In [84]: M
Out[84]: 
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
    with 8 stored elements in COOrdinate format>

In [85]: print(M)
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095
  (4109, 1) 128
  (6633, 1) 129
  (6634, 1) 127
  (6635, 1) 126

I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat (block) mechanism, but I'm most familiar with the coo format.

See Compiling n submatrices into an NxN matrix in numpy

for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend over many rows) is is ok speed wise.

With bmat I could construct the same thing as:

In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)

In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))

In [101]: print(sparse.bmat([[M1,M2]]))
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095
  (4109, 1) 128
  (6633, 1) 129
  (6634, 1) 127
  (6635, 1) 126
Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • So for more columns would you suggest combining sparse matrices? Or should I attempt something like `M=sparse.coo_matrix((append(D[:,1], D[:2]) (append(D[:,0],D[:,0]), append(np.zeros(D.shape[0])),np.ones(D.shape[0]))), shape=(7000,2))`? – Kent Shikama May 07 '16 at 05:00
  • I mean I need it for an arbitrary amount of columns so wouldn't I end up needing a for loop like I proposed in my original post? – Kent Shikama May 07 '16 at 05:02
  • Does my generalization to 2 data columns help? – hpaulj May 07 '16 at 07:00
  • Yes that is exactly what I need. Thank you! – Kent Shikama May 07 '16 at 08:03
  • At some point going back and starting this properly with a dataframe would be a nice edit – Hack-R Jun 24 '18 at 01:59
1

A simple solution would be:

import numpy as np
import pandas as pd
df = pd.DataFrame(data = [[2093, 129], [2093, 129], [2094, 129], [2095, 129]], index = [4109, 6633, 6634, 6635], columns = ['instructor_id', 'primary_department_id'])

from scipy.sparse import lil_matrix
sparse_matrix = lil_matrix((df.index.max()+1, len(df.columns)))
for k, column_name in enumerate(df.columns):
    sparse_matrix[df.index.values, np.full(len(df), k)] = df[column_name].values

If you want to use a compressed format, you can just convert it:

sparse_matrix = sparse_matrix.tocsc()
Ben
  • 639
  • 7
  • 6