1

I am reading a csv and try to take make a linear regression model based on df['LSTAT'] (x/variable) v.s. df['MEDV'] (y/target). However, the error message " ValueError: Found arrays with inconsistent numbers of samples: [ 1 343]" keeps poping out during the model fitting stage.

I have shape/re-shape the data (not sure if I have done correctly) or transform the pd.DataFrame into numpy arrays and lists. None of them works. I still don't quite understand the issue after reading this post: sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit() . The scripts and the error messages are below.

Could any guru offer some solutions with detailed explanations? Thank you!

import scipy.stats as stats
import pylab 
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression


df=pd.read_csv("input.csv")


X_train1, X_test1, y_train1, y_test1 = train_test_split(df['LSTAT'],df['MEDV'],test_size=0.3,random_state=1)

lin=LinearRegression()

################## This line: " lin_train=lin.fit(X_train1,y_train1)" causes the trouble. 

lin_train=lin.fit(X_train1,y_train1)

################## The followings are just the plotting lines after fitting the Linear regression

# The coefficients
print('Coefficients: \n', lin.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lin.predict(X_test1) - y_test1) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lin.score(X_test1, y_test1))

# Plot outputs
plt.scatter(X_test1, y_test1,  color='black')
plt.plot(X_test1, lin.predict(X_test1), color='blue',linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

Here is the warning & error message:

Warning (from warnings module):
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Traceback (most recent call last):
  File "C:/Users/Pin-Chih/Google Drive/Real_estate_projects/test.py", line 36, in <module>
    lin_train=lin.fit(X_train1,y_train1)
  File "C:\Python27\Lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
    y_numeric=True, multi_output=True)
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
    check_consistent_length(X, y)
  File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [  1 343]>>> 

If I print out the "x_train1":

X_train1:  
61     26.82
294    12.86
39     29.29
458     4.85
412     8.05
Name: LSTAT, dtype: float64

If I print out the "y_train1":

y_train1:  
61     13.4
294    22.5
39     11.8
458    35.1
412    29.0
Name: MEDV, dtype: float64
Community
  • 1
  • 1
Chubaka
  • 2,933
  • 7
  • 43
  • 58

1 Answers1

2

Certainly not a guru but I've had similar problems in the past because the model is expecting the X argument to have at least 2 dimensions, even if the second dimension is 1. The first thing I would try would be to replace

lin_train=lin.fit(X_train1,y_train1)

with

lin_train=lin.fit(X_train1.reshape(X_train1.shape[0], 1), y_train1)

which should give you data with shape (343, 1) rather than just 343.

mi_dominic
  • 78
  • 1
  • 6
  • Thank you guru the mi_dominic! But I don't quite understand the meaning of "1" here in (343,1). It is just a dummy variable/value? Moreover, if we check the scikit learn example here: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html "print diabetes_X_train" will only give us [ 0.05415152] [-0.00836158] etc., which is one dimensional. So I guess the model can take 1 dimension data? Please let me know if I am wrong. Thanks! – Chubaka Feb 17 '16 at 09:50
  • I think of it as a 343 x 1 dimensional array rather than just a series of 343 elements, but you're right; it should be the same thing. The documentation says that X should be an array in the format [n_samples, n_features] and y should be [n_samples, n_targets], but it seems that it will assume 1 for n_targets for y if not specified, but forces you to specify n_features for X. – mi_dominic Feb 17 '16 at 10:54
  • Problem solved! Thanks for your answer! But I have to say that the Scikit learn manual is not very clear and make it kinda hard to understand. Statsmodel and examples are way much clearer! – Chubaka Feb 18 '16 at 04:14