15

Is there a way to convert from a pandas.SparseDataFrame to scipy.sparse.csr_matrix, without generating a dense matrix in memory?

scipy.sparse.csr_matrix(df.values)

doesn't work as it generates a dense matrix which is cast to the csr_matrix.

Thanks in advance!

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Jake0x32
  • 1,402
  • 2
  • 11
  • 18
  • Run this in reverse? http://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix – JohnE Jun 27 '15 at 13:28

6 Answers6

15

Pandas 0.20.0+:

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:

from scipy import sparse


def sparse_df_to_csr(df):
    return sparse.csr_matrix(df.to_coo())

This uses the new to_coo() method.

Earlier Versions:

Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame is sparse with all BlockIndex (note: if it was created with get_dummies, this will be the case).

Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.

import numpy as np
import pandas as pd
from scipy import sparse

def sparse_BlockIndex_df_to_csr(df):
    columns = df.columns
    zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
                         df[col].sp_index.to_int_index().indices)
                        for col in columns])
    data, rows = map(list, zipped_data)
    cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
    data_f = np.concatenate(data)
    rows_f = np.concatenate(rows)
    cols_f = np.concatenate(cols)
    arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
                            df.shape, dtype=np.float64)
    return arr.tocsr()
T.C. Proctor
  • 6,096
  • 6
  • 27
  • 37
  • How about using `series.to_coo()` to convert each column, and `sparse.bmat()` to join those into one matrix? – hpaulj Nov 15 '16 at 20:27
  • @hpaulj That sounds like a distinct answer - you should write it up! – T.C. Proctor Nov 15 '16 at 22:28
  • Digging further I see that the Multiindex mapping is very different from simple column vectors I had in mind. It's more like the feature matrix that `sklearn` people like. – hpaulj Nov 16 '16 at 19:24
  • 1
    It seems this works now. dataset = sparse.csr_matrix(df.to_coo()) – Simd Nov 30 '17 at 10:25
4

As of Pandas version 0.25 SparseSeries and SparseDataFrame are deprecated. DataFrames now support Sparse Dtypes for columns with sparse data. Sparse methods are available through sparse accessor, so conversion one-liner now looks like this:

sparse_matrix = scipy.sparse.csr_matrix(df.sparse.to_coo())
Claygirl
  • 339
  • 2
  • 7
  • A follow-up question is: How to convert categorical columns with a large number of values to `Sparse Dtypes` efficiently? `pd.get_dummies(df, sparse = True)` takes a lot of time. – learner May 18 '20 at 17:44
3

The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.

def sparse_df_to_array(df):
    num_rows = df.shape[0]   

    data = []
    row = []
    col = []

    for i, col_name in enumerate(df.columns):
        if isinstance(df[col_name], pd.SparseSeries):
            column_index = df[col_name].sp_index
            if isinstance(column_index, BlockIndex):
                column_index = column_index.to_int_index()

            ix = column_index.indices
            data.append(df[col_name].sp_values)
            row.append(ix)
            col.append(len(df[col_name].sp_values) * [i])
        else:
            data.append(df[col_name].values)
            row.append(np.array(range(0, num_rows)))
            col.append(np.array(num_rows * [i]))

    data_f = np.concatenate(data)
    row_f = np.concatenate(row)
    col_f = np.concatenate(col)

    arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
    return arr.tocsr()
nojka_kruva
  • 1,454
  • 1
  • 10
  • 23
1

Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo:

http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

================

edit - this is a special function from a multiindex, not a data frame. See the other answers for that. Note the difference in dates.

============

As of 0.20.0, there is a sdf.to_coo() and a multiindex ss.to_coo(). Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries. While the dataframe can represent a table or 2d array.

When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • This is only for `MultiIndex`-ed `SparseSeries`, not for a DataFrame. – T.C. Proctor Jul 01 '16 at 21:08
  • As @eleanora mentioned, [this does actually work now](http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.SparseDataFrame.to_coo.html#pandas.SparseDataFrame.to_coo) (as of version 0.20.0, released May 5, 2017). `sparse.csr_matrix(df.to_coo())` is the one-liner that will do the trick. Maybe you should edit your answer to make that clear? – T.C. Proctor Nov 30 '17 at 17:09
  • Maybe we should close to topic as dated? – hpaulj Nov 30 '17 at 17:31
  • Is it common to close a perfectly valid question because the answers have become dated? I didn't think that was a thing, and it seems like a bad idea in general. – T.C. Proctor Nov 30 '17 at 18:19
  • It would save me from downvotes because my answer is no longer valid. Not that it really matters. :) – hpaulj Nov 30 '17 at 18:28
  • If you think you should remove your *answer* because you're afraid of getting downvotes, you're welcome to. Personally, I would never downvote an answer just because it required an update. I really doubt you're going to get downvotes for this. – T.C. Proctor Nov 30 '17 at 18:40
0

Here's a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).

import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix

def sparse_df_to_array(df):
    """ Convert sparse dataframe to sparse array csr_matrix used by
    scikit learn. """
    arr = lil_matrix(df.shape, dtype=np.float32)
    for i, col in enumerate(df.columns):
        ix = df[col] != 0
        arr[np.where(ix), i] = df.ix[ix, col]

    return arr.tocsr()
Marigold
  • 1,619
  • 1
  • 15
  • 17
-1

EDIT: This method is actually having a dense representation at some stage, so it doesn't solve the question.

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Marc Garcia
  • 3,287
  • 2
  • 28
  • 37
  • This method seems to use a huge amount of memory sadly. – Simd Nov 28 '17 at 20:15
  • You're right @eleanora, not sure how I tested it before, but it looks like internally this method has a dense internal representation of the array, so it's pointless for the question. Sorry for the wrong answer. – Marc Garcia Nov 30 '17 at 10:00
  • It seems this works now. `dataset = sparse.csr_matrix(df.to_coo())` – Simd Nov 30 '17 at 10:23