1

Here I have this piece of python code, taken from SoloLearn,

scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   model = LogisticRegression()
   model.fit(X_train, y_train)
   scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))

My question then is, do I need to create a new model in every split? Why don't we just create one LogisticRegression before the for?

I would put it before to save computation time, but since it has been presented this way I thought there was a reason.

Matte
  • 73
  • 6

3 Answers3

2

Great question! The answer is...you don't have to create the model each time. Your intuition is correct. Feel free to move model = LogisticRegression() to the top, outside the loop, and re-run to confirm.

The model object that exists after model.fit(X_train, y_train) each time through the loop will be the same either way.

Max Power
  • 8,265
  • 13
  • 50
  • 91
  • While this is ok, I could not recommend moving `model = LogisticRegression()` outside of the loop. The code will work, but is not intuitive as to which variable actually is your model. I suggest making two lines ones with `model = LogisticRegression().fit(X_train, y_train)`. This is recommended in sklearn and is faster than calling model.fit() on a model that has been trained before – Moss Sep 09 '20 at 20:56
  • Interesting. I strongly disagree about instantiating the model object outside the kfold loop being "not intuitive," but different strokes I guess. Do you have a source you can point me to on it being "recommended in sklearn" to instantiate a new model object each time before calling `model.fit()` if `model.fit()` is called repeatedly? – Max Power Sep 09 '20 at 21:19
  • I just found this https://github.com/keras-team/keras/issues/4446, is it the same question? It looks like, and it says that repeating the model.fit() incrementally trains the model, so it wouldn't be a good thing for the KFold. – Matte Sep 09 '20 at 22:11
  • 2
    It is recommended in sklearn to instantiate a new model object each time, although not neccecary, to avoid confusion with other libraries. In Tensorflow/karas calling model.fit() followed by a second model.fit() **continues** training. For example, `model.fit(train_images, train_labels, epochs=10)` followed by `model.fit(train_images, train_labels, epochs=8)` is the same as training 18 epochs. See related - https://stackoverflow.com/questions/62120508/python-tensorflow-running-model-fit-multiple-times-without-reinstantiating-the?noredirect=1&lq=1 – Moss Sep 09 '20 at 22:13
  • Matte - that's for Keras, not sklearn, which I assumed you're using based on the API you're using and that your model is Logistic Regression. – Max Power Sep 10 '20 at 03:29
  • Moss - Got you. Yes in sklearn a fit to update a model's weights (not retrain from scratch) is done with `partial_fit` (see e.g. https://scikit-learn.org/stable/auto_examples/cluster/plot_dict_face_patches.html?highlight=partial%20fit). I hadn't thought of the potential for confusion vs how neural network libraries would use `.fit()` to update weights, but that makes sense. – Max Power Sep 10 '20 at 03:30
1

Short answer is yes.

The reason why is because this is k-fold cross validation

Simply put, this means that you are training k number of models, evaluating the results of each and averaging together.

We do this in cases where we do not have separate data sets for training and testing. Cross validation is splitting the training data into k subgroups, each of which contains its own test/train split (we call these folds). We then train a model on the training data of the first fold and test on the test data. Repeat for all folds with a new model for each and now we have proper predictions for the full dataset.

Here is a link to a detailed description of cross validation - https://machinelearningmastery.com/k-fold-cross-validation/

Moss
  • 357
  • 1
  • 6
  • If you are looking to change this line, I would suggest combining two lines into `model = LogisticRegression().fit(X_train, y_train)` – Moss Sep 09 '20 at 20:58
1

KFold is used for cross validation, that means training a model and evaluating it.

Here is an example of documentation on the subject.

When doing that you obviously need two datasets: a training AND an evaluation data set.

When using KFold, you split your training set in number of folds (5 in your example) and run five models, using one fifth each time as the validation set and the rest of the dataset as the training set.

Now, in order to answer the question : you need a new model each time because you have five models, as each of the fifth times you have a different training set, as well as a different validation set. You must create a new one in scikit learn because when you run model.fit() the model is trained on a specific dataset, so you cannot use it for another training dataset.

If you want to create it only once, you can make copies for example :

model = LogisticRegression(**params)

def parse_kfold(model)
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):
        model_fold = model
        ...
Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19