Finding Root Mean Squared Error with Pandas dataframe

Question

I am trying to calculate the root mean squared error in from a pandas data frame. I have checked out previous links on stacked overflow such as Root mean square error in python and the scikit learn documentation http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html I was hoping someone out there would shed some light on what I am doing wrong. Here is the dataset. Here is my code.

import pandas as pd
import numpy as np
sales = pd.read_csv("home_data.csv")

from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(sales,train_size=0.8)

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
#print the y intercept
print(lm.intercept_)
#print the coefficents
print(lm.coef_)

lm.predict(300)



from math import sqrt
from sklearn.metrics import mean_squared_error
y_true=train_data.price.loc[0:5,]
test_data=test_data[['price']].reset_index()
y_pred=test_data.price.loc[0:5,]
predicted =y_pred.as_matrix()
actual= y_true.as_matrix()
mean_squared_error(actual, predicted)

EDIT

So this is what worked for me. I had to transform the test dataset values for sqft living from row to column.

from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)

New code

test_X = test_data.sqft_living.values
print(test_X)
print(np.shape(test_X))
print(len(test_X))
test_X = np.reshape(test_X, [4323, 1])
print(test_X)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
MSE = mean_squared_error(y_true = test_data.price.values, y_pred = lm.predict(test_X))
MSE
MSE**(0.5)

1. train_data or test_data are not pandas dataframes anymore, they are numpy.mdarray types. — Zero, Nov 01 '15 at 03:57
Your code is not predicting anything: you are simply splitting the data into two portions and then comparing the labels. Because the portions are different sizes, ``mean_squared_error`` cannot compare them. Could you describe what you expect this code to do? — jakevdp, Nov 01 '15 at 04:00
@jakevdp I edited my code a bit. So I created a linear regression model based on the training data. And I wanted to see how closely the test data is to predicting the training data. — Zaynaib Giwa, Nov 01 '15 at 05:04

score 16 · Answer 1 · answered Nov 01 '15 at 05:21

You're comparing test-set labels to training-set labels. I believe that what you actually want to do is compare test-set labels to predicted test-set labels.

For example:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

sales = pd.read_csv("home_data.csv")
train_data, test_data = train_test_split(sales,train_size=0.8)

# Train the model
X = train_data[['sqft_living']]
y = train_data.price
lm = LinearRegression()
lm.fit(X, y)

# Predict on the test data
X_test = test_data[['sqft_living']]
y_test = test_data.price
y_pred = lm.predict(X_test)

# Compute the root-mean-square
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)
# 260435.511036

Note that scikit-learn can in general handle Pandas DataFrames and Series inputs without explicit conversion to numpy arrays. The error in the code snippet in your question has to do with the fact that the two arrays passed to mean_squared_error() are different sizes.

Thank you! I made a slight tweak to the code that you posted. I had to transform X_test using np.reshape. Also do you know the significance of using double brackets in pandas. I know you use them for selecting multiple rows. — Zaynaib Giwa, Nov 01 '15 at 16:23
``df[['col']]`` will return a DataFrame. ``df['col']`` will return a Series. — jakevdp, Nov 01 '15 at 16:31

Finding Root Mean Squared Error with Pandas dataframe

EDIT

New code

1 Answers1