How do I create a scipy sparse matrix from a pandas dataframe?

Question

I'm searching for an better way to create a scipy sparse matrix from a pandas dataframe.

Here is the pseudocode for what I currently have

row = []; column = []; values = []
for each row of the dataframe
    for each column of the row
        add the row_id to row
        add the column_id to column
        add the value to values
sparse_matrix = sparse.coo_matrix((values, (row, column), shape=(max(row)+1,max(column)+1))

But I personally believe there would be a better way to do things. What almost worked was the following

dataframe.unstack().to_sparse().to_coo()

However, this returned me a triple of (sparse matrix, column ids, and row ids). The issue is that I need the row ids to actually be part of the sparse matrix.

Here is a full example. I have a dataframe that looks like follows

          instructor_id  primary_department_id
id
4109           2093                    129
6633           2093                    129
6634           2094                    129
6635           2095                    129

If I do the operation I mentioned above, I get

ipdb> data = dataframe.unstack().to_sparse().to_coo()[0]
ipdb> data
<2x4 sparse matrix of type '<type 'numpy.int64'>'
    with 8 stored elements in COOrdinate format>
ipdb> print data
  (0, 0)    2093
  (0, 1)    2093
  (0, 2)    2094
  (0, 3)    2095
  (1, 0)    129
  (1, 1)    129
  (1, 2)    129
  (1, 3)    129

But I need

ipdb> print data
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  etc.

I am open to using any additional libraries or dependencies.

There seems to be a question that asks for the reverse operation but I haven't found a solution for this operation.

"The issue is that I need the row ids to actually be part of the sparse matrix" - could you explain what you mean by this? — ali_m, May 07 '16 at 03:10
A full working example program would help, with hard-coded input data. I'm not sure why you'd want to turn a full, dense DataFrame into a sparse matrix--are you sure you want that? Why? — John Zwinck, May 07 '16 at 03:41
Have you looked at the pandas version of sparse? There are a couple of recent questions about going back and forth between that and scipy sparse. http://pandas.pydata.org/pandas-docs/stable/sparse.html — hpaulj, May 07 '16 at 03:50
@JohnZwinck Well I start out with a pandas sparse matrix. I suppose it wouldn't be an efficient idea on a large dataset to convert it to a dense matrix. — Kent Shikama, May 07 '16 at 04:10
@ali_m I've added an example; thanks for pointing out potential areas of confusion. — Kent Shikama, May 07 '16 at 04:10
http://stackoverflow.com/questions/33457626/huge-sparse-dataframe-to-scipy-sparse-matrix-without-dense-transform — hpaulj, May 07 '16 at 04:12
@hpaulj Yeah unfortunately it only documents switching between pandas SparseSeries and scipy sparse matricies — Kent Shikama, May 07 '16 at 04:12
@hpaulj Yeah I read that answer before I posted this question. — Kent Shikama, May 07 '16 at 04:14

score 2 · Accepted Answer · edited May 23 '17 at 12:18

I don't have pandas installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe (doesn't a method or attribute like values do that?):

In [40]: D
Out[40]: 
array([[4109, 2093],    # could be other columns
       [6633, 2093],
       [6634, 2094],
       [6635, 2095]])

Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:

In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),
   shape=(7000,1))

In [42]: M
Out[42]: 
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in COOrdinate format>

In [43]: print(M)
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095

=======================

Generalized to two 'data' columns

In [70]: D
Out[70]: 
array([[4109, 2093,  128],
       [6633, 2093,  129],
       [6634, 2094,  127],
       [6635, 2095,  126]])

In [76]: i,j,data=[],[],[]

In [77]: for col in range(1,D.shape[1]):
    i.extend(D[:,0])
    j.extend(np.zeros(D.shape[0],int)+(col-1))
    data.extend(D[:,col])
   ....:     

In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]

In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]

In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]

In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))

In [84]: M
Out[84]: 
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
    with 8 stored elements in COOrdinate format>

In [85]: print(M)
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095
  (4109, 1) 128
  (6633, 1) 129
  (6634, 1) 127
  (6635, 1) 126

I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat (block) mechanism, but I'm most familiar with the coo format.

See Compiling n submatrices into an NxN matrix in numpy

for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend over many rows) is is ok speed wise.

With bmat I could construct the same thing as:

In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)

In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))

In [101]: print(sparse.bmat([[M1,M2]]))
  (4109, 0) 2093
  (6633, 0) 2093
  (6634, 0) 2094
  (6635, 0) 2095
  (4109, 1) 128
  (6633, 1) 129
  (6634, 1) 127
  (6635, 1) 126

So for more columns would you suggest combining sparse matrices? Or should I attempt something like `M=sparse.coo_matrix((append(D[:,1], D[:2]) (append(D[:,0],D[:,0]), append(np.zeros(D.shape[0])),np.ones(D.shape[0]))), shape=(7000,2))`? — Kent Shikama, May 07 '16 at 05:00
I mean I need it for an arbitrary amount of columns so wouldn't I end up needing a for loop like I proposed in my original post? — Kent Shikama, May 07 '16 at 05:02
At some point going back and starting this properly with a dataframe would be a nice edit — Hack-R, Jun 24 '18 at 01:59

score 1 · Answer 2 · answered Aug 25 '17 at 01:37

A simple solution would be:

import numpy as np
import pandas as pd
df = pd.DataFrame(data = [[2093, 129], [2093, 129], [2094, 129], [2095, 129]], index = [4109, 6633, 6634, 6635], columns = ['instructor_id', 'primary_department_id'])

from scipy.sparse import lil_matrix
sparse_matrix = lil_matrix((df.index.max()+1, len(df.columns)))
for k, column_name in enumerate(df.columns):
    sparse_matrix[df.index.values, np.full(len(df), k)] = df[column_name].values

If you want to use a compressed format, you can just convert it:

sparse_matrix = sparse_matrix.tocsc()

How do I create a scipy sparse matrix from a pandas dataframe?

2 Answers2

Linked