I am reading a csv and try to take make a linear regression model based on df['LSTAT'] (x/variable) v.s. df['MEDV'] (y/target). However, the error message " ValueError: Found arrays with inconsistent numbers of samples: [ 1 343]" keeps poping out during the model fitting stage.
I have shape/re-shape the data (not sure if I have done correctly) or transform the pd.DataFrame into numpy arrays and lists. None of them works. I still don't quite understand the issue after reading this post: sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit() . The scripts and the error messages are below.
Could any guru offer some solutions with detailed explanations? Thank you!
import scipy.stats as stats
import pylab
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
df=pd.read_csv("input.csv")
X_train1, X_test1, y_train1, y_test1 = train_test_split(df['LSTAT'],df['MEDV'],test_size=0.3,random_state=1)
lin=LinearRegression()
################## This line: " lin_train=lin.fit(X_train1,y_train1)" causes the trouble.
lin_train=lin.fit(X_train1,y_train1)
################## The followings are just the plotting lines after fitting the Linear regression
# The coefficients
print('Coefficients: \n', lin.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((lin.predict(X_test1) - y_test1) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lin.score(X_test1, y_test1))
# Plot outputs
plt.scatter(X_test1, y_test1, color='black')
plt.plot(X_test1, lin.predict(X_test1), color='blue',linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Here is the warning & error message:
Warning (from warnings module):
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 386
DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
Traceback (most recent call last):
File "C:/Users/Pin-Chih/Google Drive/Real_estate_projects/test.py", line 36, in <module>
lin_train=lin.fit(X_train1,y_train1)
File "C:\Python27\Lib\site-packages\sklearn\linear_model\base.py", line 427, in fit
y_numeric=True, multi_output=True)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 520, in check_X_y
check_consistent_length(X, y)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 343]>>>
If I print out the "x_train1":
X_train1:
61 26.82
294 12.86
39 29.29
458 4.85
412 8.05
Name: LSTAT, dtype: float64
If I print out the "y_train1":
y_train1:
61 13.4
294 22.5
39 11.8
458 35.1
412 29.0
Name: MEDV, dtype: float64