sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Question

Just trying to do a simple linear regression but I'm baffled by this error for:

regr = LinearRegression()
regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values)

which produces:

ValueError: Found arrays with inconsistent numbers of samples: [  1 999]

These selections must have the same dimensions, and they should be numpy arrays, so what am I missing?

i did reshape(-1,1) and it worked – Its Fragilis Oct 04 '21 at 10:15 — Its Fragilis, Oct 04 '21 at 10:15

score 122 · Accepted Answer · edited Sep 03 '19 at 06:54

122

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999, ), it does not work. By using numpy.reshape(), you should change the shape of the array to (999, 1), e.g. using

data=data.reshape((999,1))

In my case, it worked with that.

edited Sep 03 '19 at 06:54

JMA

1,781
9
18

answered Jun 13 '15 at 12:00

Yul

3,216
2
18
13

6

my data shape is (10L,), how do i convert it to (10L,1). When i use data=data.reshape(len(data),1), the resulting shape is (10L,1L) not (10L,1) – user3841581 Nov 17 '15 at 15:18
@user3841581 please refer to this [post](http://stackoverflow.com/q/40440997/4896087). – George Liu Nov 05 '16 at 17:12
1

@Boern Thanks for the comment. I also discovered that X_train should be of size (N,1) but y_train should be of size (N,) not (N,1), otherwise it does not work, at least not for me. – Vahid S. Bokharaie Aug 17 '17 at 14:04
data.reshape(...) may show deprication warning if data is Series object. Use data.values.reshape(...) – NightFurry Oct 26 '17 at 19:09
data = data.reshape(-1,1) – Itachi Apr 09 '19 at 12:43

score 25 · Answer 2 · answered Sep 18 '16 at 03:04

Looks like you are using pandas dataframe (from the name df2).

You could also do the following:

regr = LinearRegression()
regr.fit(df2.iloc[1:1000, 5].to_frame(), df2.iloc[1:1000, 2].to_frame())

NOTE: I have removed "values" as that converts the pandas Series to numpy.ndarray and numpy.ndarray does not have attribute to_frame().

score 14 · Answer 3 · answered Dec 19 '17 at 11:23

14

Seen on the Udacity deep learning foundation course:

df = pd.read_csv('my.csv')
...
regr = LinearRegression()
regr.fit(df[['column x']], df[['column y']])

answered Dec 19 '17 at 11:23

xilef

2,199
22
16

3

Thanks! This is really the simplest and easiest to understand! – Juan A. Navarro May 02 '18 at 09:04
Actually, the Y parameter is expected as a (length, ) shape. Thanks! – Michael_Zhang Nov 23 '19 at 02:45

score 6 · Answer 4 · answered May 24 '16 at 16:32

6

I think the "X" argument of regr.fit needs to be a matrix, so the following should work.

regr = LinearRegression()
regr.fit(df2.iloc[1:1000, [5]].values, df2.iloc[1:1000, 2].values)

answered May 24 '16 at 16:32

Anish

176
2
4

Josh Grinberg · Answer 5 · 2016-11-11T03:49:45.290

4

I encountered this error because I converted my data to an np.array. I fixed the problem by converting my data to an np.matrix instead and taking the transpose.

ValueError: regr.fit(np.array(x_list), np.array(y_list))

Correct: regr.fit(np.transpose(np.matrix(x_list)), np.transpose(np.matrix(y_list)))

edited Nov 11 '16 at 03:49

answered Nov 11 '16 at 03:31

Josh Grinberg

523
6
14

score 3 · Answer 6 · edited Jul 24 '16 at 22:52

3

expects X(feature matrix)

Try to put your features in a tuple like this:

features = ['TV', 'Radio', 'Newspaper']
X = data[features]

edited Jul 24 '16 at 22:52

The SE I loved is dead

1,517
4
23
27

answered Jul 24 '16 at 21:22

Yuanxu Xu

127
1
3

score 1 · Answer 7 · answered Mar 13 '19 at 09:40

I faced a similar problem. The problem in my case was, Number of rows in X was not equal to number of rows in y.

i.e. number of entries in feature columns was not equal to number of entires in target variable since I had dropped some rows from freature columns.

score 0 · Answer 8 · edited Jul 11 '17 at 01:48

To analyze two arrays (array1 and array2) they need to meet the following two requirements:

1) They need to be a numpy.ndarray

Check with

type(array1)
# and
type(array2)

If that is not the case for at least one of them perform

array1 = numpy.ndarray(array1)
# or
array2 = numpy.ndarray(array2)

2) The dimensions need to be as follows:

array1.shape #shall give (N, 1)
array2.shape #shall give (N,)

N is the number of items that are in the array. To provide array1 with the right number of axes perform:

array1 = array1[:, numpy.newaxis]

score 0 · Answer 9 · answered Jul 30 '17 at 22:13

As it was mentioned above X argument must be a matrix or a numpy array with known dimensions. So you could probably use this:

df2.iloc[1:1000, 5:some_last_index].values

So your dataframe would be converted to an array with known dimensions and you won't need to reshape it

score 0 · Answer 10 · answered May 21 '20 at 06:47

during train test split you might have done a mistake

x_train,x_test,y_train,y_test=sklearn.model_selection.train_test_split(X,Y,test_size)

The above code is correct

You might have done like below which is wrong

x_train,y_train,x_test,y_test=sklearn.model_selection.train_test_split(X,Y,test_size)

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

10 Answers10

Linked

Related