228

I need to fit RandomForestRegressor from sklearn.ensemble.

forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)

This code always worked until I made some preprocessing of data (train_y). The error message says:

DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

model = forest.fit(train_fold, train_y)

Previously train_y was a Series, now it's numpy array (it is a column-vector). If I apply train_y.ravel(), then it becomes a row vector and no error message appears, through the prediction step takes very long time (actually it never finishes...).

In the docs of RandomForestRegressor I found that train_y should be defined as y : array-like, shape = [n_samples] or [n_samples, n_outputs] Any idea how to solve this issue?

Community
  • 1
  • 1
Klausos Klausos
  • 15,308
  • 51
  • 135
  • 217
  • what is `train_fold.shape` and `train_y.shape`? – Alexander Dec 08 '15 at 21:05
  • @Alexander: train_fold: tuple (749904,24)... train:y.ravel(): tuple (749904,) – Klausos Klausos Dec 08 '15 at 21:16
  • Looks fine. Have you tried training a 100 rows of the data to ensure it works properly (since you said it never finished)? Also, have you examined the contents of your `train_y` data to ensure preprocessing didn't corrupt it? – Alexander Dec 08 '15 at 21:22
  • Print `RF_tuned_parameters` for us please. – Imanol Luengo Dec 08 '15 at 21:25
  • @imaluengo: {'n_estimators': 40, 'max_features': 0.8, 'n_jobs': 2, 'verbose': True, 'min_samples_split': 6, 'random_state': 123} – Klausos Klausos Dec 08 '15 at 21:29
  • @Alexander doesnt need to be. Documentation says that float is treated as a percentage. However @KlausosKlausos, can you try with `max_features='sqrt'`, just in case some bug. Also, which version of sklearn do you have? – Imanol Luengo Dec 08 '15 at 21:35
  • @Alexander: Everything works with 100 entries. But with the whole set it lasts forever... – Klausos Klausos Dec 08 '15 at 21:45
  • OK. Use %timeit to check for performance difference using 10k rows using both your old method and your new method (using the same RF_tuned_parameters). You can even use `old_Y_series.values.ravel()` to get it in the same format as what you are currently using. Is the transformation of the Y values itself responsible for the extra computational time? – Alexander Dec 08 '15 at 21:49
  • did the data type change from the original series? you could put it back as a series (using `pd.Series(train_y)` and try again?) Also what was your preprocessing of `y`? Finally 750K rows is a very large data set, i would expect this to take a long time to train, possibly on the order of hours. – maxymoo Dec 08 '15 at 21:51
  • @maxymoo: Surprisingly it takes 7 minutes to train, BUT prediction takes a very long time. Actually Ive been waiting for few hours and didn't get a result. It's strange, because fitting must take longer time than predicting. – Klausos Klausos Dec 08 '15 at 21:57
  • @Alexander: I eliminated max_features and now it's a moment to run the prediction. Hmm, but very strange why this happened. – Klausos Klausos Dec 08 '15 at 22:24
  • Are any of the solutions correct? If so, can you please mark one as such? – Danny Bullis Feb 04 '22 at 07:37

9 Answers9

382

Change this line:

model = forest.fit(train_fold, train_y)

to:

model = forest.fit(train_fold, train_y.values.ravel())

Explanation:

.values will give the values in a numpy array (shape: (n,1))

.ravel will convert that array shape to (n, ) (i.e. flatten it)

mirekphd
  • 4,799
  • 3
  • 38
  • 59
Linda MacPhee-Cobb
  • 7,646
  • 3
  • 20
  • 18
25

I also encountered this situation when I was trying to train a KNN classifier. but it seems that the warning was gone after I changed:
knn.fit(X_train,y_train)
to
knn.fit(X_train, np.ravel(y_train,order='C'))

Ahead of this line I used import numpy as np.

sorak
  • 2,607
  • 2
  • 16
  • 24
Simon Leung
  • 251
  • 3
  • 2
  • When using the `.ravel()` approach my column vector was converter to a row vector rather than an array, but this fix worked for me. – kabdulla Oct 30 '18 at 10:45
22

I had the same problem. The problem was that the labels were in a column format while it expected it in a row. use np.ravel()

knn.score(training_set, np.ravel(training_labels))

Hope this solves it.

Pramesh Bajracharya
  • 2,153
  • 3
  • 28
  • 54
14

use below code:

model = forest.fit(train_fold, train_y.ravel())

if you are still getting slap by error as identical as below ?

Unknown label type: %r" % y

use this code:

y = train_y.ravel()
train_y = np.array(y).astype(int)
model = forest.fit(train_fold, train_y)
Soumyaansh
  • 8,626
  • 7
  • 45
  • 45
4
Y = y.values[:,0]

Y - formated_train_y

y - train_y
AlSub
  • 1,384
  • 1
  • 14
  • 33
  • Please add a few lines to explain your answer, just posting the code does not do any good to any of the readers. Thanks. – Costa Oct 06 '21 at 15:45
3

Another way of doing this is to use ravel

model = forest.fit(train_fold, train_y.values.reshape(-1,))
sushmit
  • 4,369
  • 2
  • 35
  • 38
2

With neuraxle, you can easily solve this :

p = Pipeline([
   # expected outputs shape: (n, 1)
   OutputTransformerWrapper(NumpyRavel()), 
   # expected outputs shape: (n, )
   RandomForestRegressor(**RF_tuned_parameters)
])

p, outputs = p.fit_transform(data_inputs, expected_outputs)

Neuraxle is a sklearn-like framework for hyperparameter tuning and AutoML in deep learning projects !

AlexB
  • 3,518
  • 4
  • 29
  • 46
1
format_train_y=[]
for n in train_y:
    format_train_y.append(n[0])
Dharman
  • 30,962
  • 25
  • 85
  • 135
Bibby Wang
  • 11
  • 1
  • 4
    While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Dharman Apr 04 '20 at 11:47
1

TL;DR
use

y = np.squeeze(y)

instead of

y = y.ravel()

As Python's ravel() may be a valid way to achieve the desired results in this particular case, I would, however, recommend using numpy.squeeze().
The problem here is, that if the shape of your y (numpy array) is e.g. (100, 2), then y.ravel() will concatenate the two variables on the second axis along the first axis, resulting in a shape like (200,). This might not be what you want when dealing with independent variables that have to be regarded on their own.
On the other hand, numpy.squeeze() will just trim any redundant dimensions (i.e. which are of size 1). So, if your numpy array's shape is (100, 1), this will result in an array of shape (100,), whereas the result for a numpy array of shape (100, 2) will not change, as none of the dimensions have size 1.

Marcel H.
  • 197
  • 1
  • 2
  • 14