Scikit learn: measure of goodness of fit, better splitting the dataset or use all of it?

Question

Sort of taking inspiration from here.

My problem

So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.

From the question above people use R^2 for this purpose. But I am not sure I understand..

Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?

All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.

For instance this is my idea

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))

However I often see people doing things like

diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test 
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))

In the first instance we get: 3890.4565854612724 while in the second case we get 2548.07. Which is the most correct one?

IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!

score 1 · Answer 1 · answered Aug 31 '17 at 11:06

Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?

No, you will run the risk of overfitting the model. That's the reason for the data to be split into train and test (or, even validation datasets). So, that the model doesn't just 'memorize' what it sees but learns to perform even on newer, unseen samples.

score 1 · Answer 2 · answered Sep 01 '17 at 11:29

It's always preferred to evaluate the performance of the model on a new set of data that wasn't observed during training. If you're going to optimize hyper-parameters or choosing among several models, an additional validation data is a right choice.

However, sometimes the data is scarce and entirely removing data from the training process is prohibitive. In these cases, I strongly recommend you to use more efficient ways of validating your models such as k-fold cross-validation (see KFold and StratifiedKFold in scikit-learn).

Finally, it is a good idea to ensure that your partitions behave in a similar way in the training and test sets. I recommend you to sample the data uniformly on the target space so you can ensure that you train/validate your model with the same distribution of target values.

Scikit learn: measure of goodness of fit, better splitting the dataset or use all of it?

2 Answers2