6

I have a SQL table which I can read in as a Pandas data frame, that has the following structure:

user_id    value
1          100
1          200
2          100
4          200

It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:

    100  200
1   1    1
2   1    0
4   0    1

Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.

Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?

Patrick McCarthy
  • 2,478
  • 2
  • 24
  • 40
  • I haven't seen much discussion of SparseSeries on SO. I've answered a few questions on transferring back and forth between that and the `scipy` `sparse` matrices. My impression is that the Pandas `sparse` structure is still under development. – hpaulj Feb 23 '16 at 19:02
  • I've found a few, like your answer here http://stackoverflow.com/questions/34181494/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-coo-matrix The problem is it doesn't seem to scale. Right now I'm trying to convert a csc matrix that's 40,000 x 15,000 and it's been running for more than 30 minutes. – Patrick McCarthy Feb 23 '16 at 19:09

1 Answers1

2

I arrived at a solution, albeit a slightly imperfect one.

What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.

Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.

import pandas
import numpy

df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})

# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()


# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})


vec_dict = {}

# Create a sparse vector for each feature
for i in range(num_features):
    users_with_feature = df[df['value']==unique_features[i]]['user_id']

    uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']

    vec = numpy.zeros(num_rows)
    vec[uid_rows] = 1

    sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)

    vec_dict[unique_features[i]] = sparse_vec


my_pandas_frame = pandas.DataFrame(vec_dict)    
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id']) 

The results:

>>> my_pandas_frame
         100  200
user_id          
1          1    1
2          1    0
4          0    1

>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>

>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>

Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.

Patrick McCarthy
  • 2,478
  • 2
  • 24
  • 40