A column-vector y was passed when a 1d array was expected

Question

I need to fit RandomForestRegressor from sklearn.ensemble.

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

This code always worked until I made some preprocessing of data (train_y). The error message says:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

model = forest.fit(train_fold, train_y)

Previously train_y was a Series, now it's numpy array (it is a column-vector). If I apply train_y.ravel(), then it becomes a row vector and no error message appears, through the prediction step takes very long time (actually it never finishes...).

In the docs of RandomForestRegressor I found that train_y should be defined as y : array-like, shape = [n_samples] or [n_samples, n_outputs] Any idea how to solve this issue?

@Alexander: train_fold: tuple (749904,24)... train:y.ravel(): tuple (749904,) — Klausos Klausos, Dec 08 '15 at 21:16
Looks fine. Have you tried training a 100 rows of the data to ensure it works properly (since you said it never finished)? Also, have you examined the contents of your `train_y` data to ensure preprocessing didn't corrupt it? — Alexander, Dec 08 '15 at 21:22
@imaluengo: {'n_estimators': 40, 'max_features': 0.8, 'n_jobs': 2, 'verbose': True, 'min_samples_split': 6, 'random_state': 123} — Klausos Klausos, Dec 08 '15 at 21:29
@Alexander doesnt need to be. Documentation says that float is treated as a percentage. However @KlausosKlausos, can you try with `max_features='sqrt'`, just in case some bug. Also, which version of sklearn do you have? — Imanol Luengo, Dec 08 '15 at 21:35
@Alexander: Everything works with 100 entries. But with the whole set it lasts forever... — Klausos Klausos, Dec 08 '15 at 21:45
OK. Use %timeit to check for performance difference using 10k rows using both your old method and your new method (using the same RF_tuned_parameters). You can even use `old_Y_series.values.ravel()` to get it in the same format as what you are currently using. Is the transformation of the Y values itself responsible for the extra computational time? — Alexander, Dec 08 '15 at 21:49
did the data type change from the original series? you could put it back as a series (using `pd.Series(train_y)` and try again?) Also what was your preprocessing of `y`? Finally 750K rows is a very large data set, i would expect this to take a long time to train, possibly on the order of hours. — maxymoo, Dec 08 '15 at 21:51
@maxymoo: Surprisingly it takes 7 minutes to train, BUT prediction takes a very long time. Actually Ive been waiting for few hours and didn't get a result. It's strange, because fitting must take longer time than predicting. — Klausos Klausos, Dec 08 '15 at 21:57
@Alexander: I eliminated max_features and now it's a moment to run the prediction. Hmm, but very strange why this happened. — Klausos Klausos, Dec 08 '15 at 22:24
Are any of the solutions correct? If so, can you please mark one as such? — Danny Bullis, Feb 04 '22 at 07:37

score 382 · Answer 1 · edited Dec 31 '21 at 13:58

382

Change this line:

model = forest.fit(train_fold, train_y)

to:

model = forest.fit(train_fold, train_y.values.ravel())

Explanation:

.values will give the values in a numpy array (shape: (n,1))

.ravel will convert that array shape to (n, ) (i.e. flatten it)

edited Dec 31 '21 at 13:58

mirekphd

4,799
3
38
59

answered Mar 20 '16 at 21:49

Linda MacPhee-Cobb

7,646
3
20
18

50

Someone might explain what it actually changes. – Rahul Bali Jun 03 '17 at 23:34
13

AttributeError: 'numpy.ndarray' object has no attribute 'values' – john k Nov 23 '17 at 20:05
21

If you have a numpy.ndarray, then use train_y.ravel() instead. – Charity Leschinski Dec 03 '17 at 19:33
19

@RahulParashar what `ravel()` does is: when you have `y.shape == (10, 1)`, using `y.ravel().shape == (10, )`. In words... it flattens an array. – PascalVKooten Aug 11 '18 at 14:42
7

Is this even a useful warning? – alex Feb 08 '20 at 19:17
Or assuming the dependent variable in y to be only one, and assuming the training_test from sklearn was used, you can access straight to the only one column in the dataset as y['whatever_your_feature_is'] – Andrea Moro Aug 26 '20 at 13:56
I got the same message for SVM. Being new to machine learning, I didn't knew how to use ravel. Thanks for giving a short explanation for it – Ankit Shaw Oct 05 '22 at 13:37

score 25 · Answer 2 · edited Mar 11 '18 at 07:04

25

I also encountered this situation when I was trying to train a KNN classifier. but it seems that the warning was gone after I changed:
knn.fit(X_train,y_train)
to
knn.fit(X_train, np.ravel(y_train,order='C'))

Ahead of this line I used import numpy as np.

edited Mar 11 '18 at 07:04

sorak

2,607
2
16
24

answered Mar 11 '18 at 06:41

Simon Leung

251
3
2

When using the `.ravel()` approach my column vector was converter to a row vector rather than an array, but this fix worked for me. – kabdulla Oct 30 '18 at 10:45

score 22 · Answer 3 · edited Jan 11 '19 at 07:16

22

I had the same problem. The problem was that the labels were in a column format while it expected it in a row. use np.ravel()

knn.score(training_set, np.ravel(training_labels))

Hope this solves it.

edited Jan 11 '19 at 07:16

Pramesh Bajracharya

2,153
3
28
54

answered Dec 10 '18 at 22:37

mohammad hassan bigdeli shamlo

756
1
6
19

1

You mean `np.ravel()`? – Pramesh Bajracharya Jan 11 '19 at 05:57

Soumyaansh · Answer 4 · 2016-10-19T19:55:51.753

14

use below code:

model = forest.fit(train_fold, train_y.ravel())

if you are still getting slap by error as identical as below ?

Unknown label type: %r" % y

use this code:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)

edited Oct 19 '16 at 19:55

answered Oct 19 '16 at 19:41

Soumyaansh

8,626
7
45
45

This worked for me, dont know the background working. Still, I feel like I can explore what was it. – MANEESH MOHAN May 25 '21 at 12:37

score 4 · Answer 5 · edited Aug 17 '21 at 08:17

4

Y = y.values[:,0]

Y - formated_train_y

y - train_y

edited Aug 17 '21 at 08:17

AlSub

1,384
1
14
33

answered Jun 06 '20 at 06:36

Jeyakeethan Geethan

79
2
7

Please add a few lines to explain your answer, just posting the code does not do any good to any of the readers. Thanks. – Costa Oct 06 '21 at 15:45

score 3 · Answer 6 · answered Aug 09 '18 at 21:39

3

Another way of doing this is to use ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))

answered Aug 09 '18 at 21:39

sushmit

4,369
2
35
38

I'd just like to add that this will work for Pandas Series, but not Pandas DataFrames. – Sal Alturaigi May 12 '19 at 20:23

AlexB · Answer 7 · 2020-03-12T10:13:09.677

2

With neuraxle, you can easily solve this :

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle is a sklearn-like framework for hyperparameter tuning and AutoML in deep learning projects !

edited Mar 12 '20 at 10:13

answered Mar 12 '20 at 10:07

AlexB

3,518
4
29
46

NumpyRavel throws exception when using it in an AutoML search. – Adrian Oct 24 '22 at 10:25

score 1 · Answer 8 · edited Apr 04 '20 at 11:47

1

format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])

edited Apr 04 '20 at 11:47

Dharman

30,962
25
85
135

answered Apr 04 '20 at 07:26

Bibby Wang

11
1

4

While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Dharman Apr 04 '20 at 11:47

score 1 · Answer 9 · answered May 05 '22 at 10:52

TL;DR
use

y = np.squeeze(y)

instead of

y = y.ravel()

As Python's ravel() may be a valid way to achieve the desired results in this particular case, I would, however, recommend using numpy.squeeze().
The problem here is, that if the shape of your y (numpy array) is e.g. (100, 2), then y.ravel() will concatenate the two variables on the second axis along the first axis, resulting in a shape like (200,). This might not be what you want when dealing with independent variables that have to be regarded on their own.
On the other hand, numpy.squeeze() will just trim any redundant dimensions (i.e. which are of size 1). So, if your numpy array's shape is (100, 1), this will result in an array of shape (100,), whereas the result for a numpy array of shape (100, 2) will not change, as none of the dimensions have size 1.

A column-vector y was passed when a 1d array was expected

9 Answers9

Linked