Machine learning procedure splitting the data into 3 sets

Question

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips If we follow some interesting approaches like what I found in here: Stratified Train/Validation/Test-split in scikit-learn

Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:

# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Now if we want to make predictions we could use:

# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

What when one have to estimate accuracy of the model a common approach is:

# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:

# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)

Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

This is a much longer discussion that doesn't belong here, since it's about the fundamentals of how to do ML (or how to structure an experiment). The short version: you use cross-validation to score predictions, for the purpose of _model selection_. `cross_val_score` automatically slices pieces out of `train` to validate on. Once you've compared models and found the best one, and **only once**: evaluate the model on the held-out test set. — Arya McCarthy, Feb 26 '18 at 00:23
What do you excatly mean with only evaluate the model on the held-out test set once I've compared models?. You mean avoid making predictions as I mentioned above or to calculate accuracy? Thanks. — Steve Jade, Feb 26 '18 at 00:35
I mean that you should go through the whole process of tuning your parameters and hyperparameters *without ever touching the test set*. You can use `cross_val_score` or just use `cross_val_predict` and score the predictions. Once you have your One True Model, then you can use it on the test set with the prediction and accuracy functions you mentioned. — Arya McCarthy, Feb 26 '18 at 01:28

Machine learning procedure splitting the data into 3 sets

0 Answers0