3

I want to convert a pandas SparseDataFrame to a scipy.sparse.csc_matrix. But I don't want to convert it back to a dense matrix first.

Right now I have something like the below.

df = pd.get_dummies(df, sparse=True)

Basically what I need is to further get a scipy.sparse.csc_matrix from df. Is there a way to do it?

George Sovetov
  • 4,942
  • 5
  • 36
  • 57
Han Fang
  • 41
  • 1
  • 3

2 Answers2

1

Thanks to @hpaulj's reply. I ended it up using the template from https://stackoverflow.com/a/38157234/7298911.

Here is the modified implementation.

def sparseDfToCsc(df):
    columns = df.columns
    dat, rows = map(list,zip(*[(df[col].sp_values-df[col].fill_value, df[col].sp_index.to_int_index().indices) for col in columns]))
    cols = [np.ones_like(a)*i for (i,a) in enumerate(dat)]
    datF, rowsF, colsF = np.concatenate(dat), np.concatenate(rows), np.concatenate(cols)
    arr = sparse.coo_matrix((datF, (rowsF, colsF)), df.shape, dtype=np.float64)
    return arr.tocsc()

df = pd.get_dummies(df, sparse=True)
cscMatrix = sparseDfToCsc(df)
Community
  • 1
  • 1
Han Fang
  • 41
  • 1
  • 3
0

I've participated in various sparse Pandas to scipy sparce questions.

There is a Pandas method for converting a multiindex sparse series to coo matrix:

http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

But see Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory for data frame to sparse.

and

How do I create a scipy sparse matrix from a pandas dataframe?

and more recently, How can I "sparsify" on two values?

Once you have a coo matrix, you can easily convert it to csr or csc.

To avoid confusion I'd suggest creating a sample dataframe, convert to dense and then to sparse. That we have something concrete to test. I used to recommend the Pandas method, without realizing that MultiIndex was different from DataFrame.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the reply @hpaulj. If I understand you correctly, the best approach should be [Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory](http://stackoverflow.com/questions/31084942/pandas-sparse-dataframe-to-sparse-matrix-without-generating-a-dense-matrix-in-m). Right? – Han Fang Dec 16 '16 at 15:48