Scikit-learn cross val score: too many indices for array

Question

I have the following code

 from sklearn.ensemble import ExtraTreesClassifier
 from sklearn.cross_validation import cross_val_score
 #split the dataset for train and test
 combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75
 train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False]

 et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0)
 min_samples_split=10, random_state=0  )

 labels = train[list(label_columns)].values
 tlabels = test[list(label_columns)].values

 features = train[list(columns)].values
 tfeatures = test[list(columns)].values

 et_score = cross_val_score(et, features, labels, n_jobs=-1)
 print("{0} -> ET: {1})".format(label_columns, et_score))

Checking the shape of the arrays:

 features.shape
 Out[19]:(43069, 34)

And

labels.shape
Out[20]:(43069, 1)

and I'm getting:

IndexError: too many indices for array

and this relevant part of the traceback:

---> 22 et_score = cross_val_score(et, features, labels, n_jobs=-1)

I'm creating the data from Pandas dataframes and I searched here and saw some reference to possible errors via this method but can't figure out how to correct? What the data arrays look like: features

Out[21]:
array([[ 0.,  1.,  1., ...,  0.,  0.,  1.],
   [ 0.,  1.,  1., ...,  0.,  0.,  1.],
   [ 1.,  1.,  1., ...,  0.,  0.,  1.],
   ..., 
   [ 0.,  0.,  1., ...,  0.,  0.,  1.],
   [ 0.,  0.,  1., ...,  0.,  0.,  1.],
   [ 0.,  0.,  1., ...,  0.,  0.,  1.]])

labels

Out[22]:
array([[1],
   [1],
   [1],
   ..., 
   [1],
   [1],
   [1]])

Please post the full traceback. Which version of scikit-learn are you on? And can you try to pass ``labels.ravel()`` instead? — Andreas Mueller, Aug 13 '15 at 18:01
labels.ravel() did it! Was just reading another error , that suggest that using different code,, I'm on Scikit learn 17 dev0 — dartdog, Aug 13 '15 at 18:04
@AndreasMueller Thank you so much for the quick response! Might help a few others if you can make an answer.. — dartdog, Aug 13 '15 at 18:33
I don't think this is a good error. Can you please open an issue in the issue tracker? — Andreas Mueller, Aug 13 '15 at 21:21
MMM where should I post the error to? Pandas or scikit-learn? — dartdog, Aug 13 '15 at 22:40
Scikit-learn. with a minimum reproducing example that doesn't rely on pandas with generated numpy arrays. — Andreas Mueller, Aug 14 '15 at 14:59
Given that a pandas dataframe doesn't have a `ravel` method, is `numpy.squeeze`-ing whatever you want y values to be a more robust solution? (@AndreasMueller) — James, Nov 10 '15 at 12:30
@James What happens to a dataframe if you squeeze it? Is ``DataFrame.values`` not always 2d? — Andreas Mueller, Nov 12 '15 at 17:34
@AndreasMueller values of a DataFrame with one column are n x 1 (so yes). `np.squeeze` will shrink to shape `(n,)` if the dataframe shape is n x 1 and won't modify shape otherwise (which will throw this error). `np.ravel` will flatten any numpy array, so if your shape is wrong to begin with, you'll end up with something really long that also can't be aligned I think? — James, Nov 12 '15 at 18:15

score 36 · Accepted Answer · edited Aug 23 '16 at 17:40

36

When we do cross validation in scikit-learn, the process requires an (R,) shape label instead of (R,1). Although they are the same thing to some extend, their indexing mechanisms are different. So in your case, just add:

c, r = labels.shape
labels = labels.reshape(c,)

before passing it to the cross-validation function.

edited Aug 23 '16 at 17:40

Yasel

2,920
4
40
48

answered Aug 23 '16 at 16:58

YE LIANG HARRY

476
4
3

score 15 · Answer 2 · answered Mar 08 '16 at 17:11

15

It seems to be fixable if you specify the target labels as a single data column from Pandas. If the target has multiple columns, I get a similar error. For example try:

labels = train['Y']

answered Mar 08 '16 at 17:11

Bud

311
4
6

1

This answer was more simple and direct. Thank you. – Rajesh Mappu Feb 04 '18 at 03:30

score 2 · Answer 3 · edited Sep 23 '17 at 03:23

2

Adding .ravel() to the Y/Labels variable passed into the formula helped solve this problem within KNN as well.

edited Sep 23 '17 at 03:23

Stephen Rauch

47,830
31
106
135

answered Sep 23 '17 at 02:59

MSalty

4,086
2
12
16

score 0 · Answer 4 · edited Oct 17 '17 at 09:25

0

try target:

y=df['Survived']

instead , i used

y=df[['Survived']]

which made the target y a dateframe, it seems series would be ok

edited Oct 17 '17 at 09:25

Marian Nasry

821
9
22

answered Oct 17 '17 at 09:07

Yang Zhao

1

score 0 · Answer 5 · answered Feb 03 '18 at 12:18

0

You might need to play with the dimensions a bit, e.g.

et_score = cross_val_score(et, features, labels, n_jobs=-1)[:,n]

or

 et_score = cross_val_score(et, features, labels, n_jobs=-1)[n,:]

n being the dimension.

answered Feb 03 '18 at 12:18

Gursel Karacor

1,137
11
21

Scikit-learn cross val score: too many indices for array

5 Answers5

Linked