3

I want to understand how to work with sparse matrices. I have this code to generate multi-label classification data set as a sparse matrix.

from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse = True, n_labels = 20, return_indicator = 'sparse', allow_unlabeled = False)

This code gives me X in the following format:

<100x20 sparse matrix of type '<class 'numpy.float64'>' 
with 1797 stored elements in Compressed Sparse Row format>

y:

<100x5 sparse matrix of type '<class 'numpy.int64'>'
with 471 stored elements in Compressed Sparse Row format>

Now I need to split X and y into X_train, X_test, y_train and y_test, so that train set consitutes 70%. How can I do it?

This is what I tried:

X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, stratify=y, test_size=0.3)

and got the error message:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • 1
    The error message itself suggests a solution. Run `train_test_split()` function after converting the sparse matrices into dense by calling `X.toarray()` and `y.toarray()` – Chinni Sep 09 '19 at 20:26
  • @Chinni: Thanks! Can you post the answer? – Fluxy Sep 09 '19 at 20:31

2 Answers2

1

The error message itself seems to suggest the solution. Need to convert both X and y to dense matrices.

Please do the following,

X = X.toarray()
y = y.toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Chinni
  • 1,199
  • 1
  • 14
  • 28
  • Could you please elaborate what is the meaning of `stratify=y`? – Fluxy Sep 09 '19 at 20:35
  • Also, for me only this statement works: `X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, test_size=0.3)` – Fluxy Sep 09 '19 at 20:36
  • Could you please check which version of sklearn you are using? - https://stackoverflow.com/questions/34842405/parameter-stratify-from-method-train-test-split-scikit-learn – Chinni Sep 09 '19 at 20:37
  • I use the version '0.23.0' – Fluxy Sep 09 '19 at 20:42
  • Is there any error that you get when you use `stratify=y`? If so, could you please post it or update your question? – Chinni Sep 09 '19 at 20:43
  • And I hope you are importing `train_test_split` correctly - https://stackoverflow.com/a/46716676/3476748 – Chinni Sep 09 '19 at 20:47
  • Yes, sure, I import it as follows `from sklearn.model_selection import train_test_split`. If I run your code, I get this error `TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.` – Fluxy Sep 09 '19 at 20:51
  • Interesting. Could you please check `type(X)` and `type(y)` after executing `toarray()` on them? – Chinni Sep 09 '19 at 20:57
1

The problem is due to stratify=y. If you look at the documentation for train_test_split, we can see that

*arrays :

  • Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

stratify :

  • array-like (does not mention sparse matrices)

Now unfortunately, this dataset doesn't work well with stratify even if it were cast to a dense array:

>>> X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y.toarray(), test_size=0.3)
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Matt Eding
  • 917
  • 1
  • 8
  • 15