2

I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies to do one-hot encoding, and I have to use sparse=Trueoption because the data is a little big(otherwise exceptions/errors are raised).

result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)

Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value into sklearn,

train_data = result_encoded.drop(['count'].values)

but raised the error: "array is too big". Then, I just fed sparse dataframe to sklearn, similar error message showed again.

train_data = result_encoded.drop(['count'])

Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?

user3162587
  • 233
  • 1
  • 6
  • 18
  • (Reposting an old comment) Rather than describing your data in words, you can write a short, runnable piece of code. If you make it so people can copy, paste and run the code in your question without undefined variables and other problems, then a) you will make your desired output crystal clear and b) you are more likely to get good answers. [Here's an example.](http://stackoverflow.com/q/31714641/553404) – YXD May 22 '16 at 14:30
  • Check the `pandas` documentation for sparse frames. I believe there's a method, possibly experimental, to produce a scipy sparse matrix. I've answered a few questions on this, but don't have the links at hand. – hpaulj May 22 '16 at 14:56
  • http://stackoverflow.com/a/34185851/901925 – hpaulj May 22 '16 at 14:59
  • 1
    I would consider using scikit's OneHotEncoder instead. The resulting sparse matrix can be directly used with scikits predictors. – Alexander Bauer May 22 '16 at 15:29
  • Possible duplicate of [Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory](http://stackoverflow.com/questions/31084942/pandas-sparse-dataframe-to-sparse-matrix-without-generating-a-dense-matrix-in-m) – Marc Garcia Jan 03 '17 at 16:58

1 Answers1

2

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:

result_encoded, idx_rows, idx_cols = result_encoded.stack().to_sparse().to_coo()
result_encoded = result_encoded.tocsr()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().

The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).

In general, you'll want to more efficient CSR or CCR format for your sparse scipy array, instead of the simpler COO, so you can convert it with the .tocsr() method.

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Marc Garcia
  • 3,287
  • 2
  • 28
  • 37