Sort of taking inspiration from here.
My problem
So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.
From the question above people use R^2 for this purpose. But I am not sure I understand..
Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?
All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.
For instance this is my idea
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))
However I often see people doing things like
diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
In the first instance we get: 3890.4565854612724
while in the second case we get 2548.07
. Which is the most correct one?
IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!