2

I'm following a kernel on Kaggle and came across this code.

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

I understand the purpose and use of KFold and the fact that is used in 'cross_val_score'. What I don't get is why 'get_n_split' is used? As far as I am aware it returns the number of iterations used for cross validation i.e. returns a value of 5 in this case. Surely for this line:

rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))

cv = 5? This doesn't make any sense to me. Why it even necessary to use get_n_splits if it returns an integer? I thought KFold returns a class whereas get_n_splits returns an integer.

Anyone can clear my understanding?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
apang
  • 93
  • 1
  • 12

1 Answers1

4

I thought KFold returns a class whereas get_n_splits returns an integer.

Sure, KFold is a class, and one of the class methods is get_n_splits, which returns an integer; your shown kf variable

kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)

is not a KFold class object, it is the result of the KFold().get_n_splits() method, and it is indeed an integer. In fact, if you check the documentation, get_n_splits() does not even need any arguments (they are actually ignored, and exist only for compatibility reasons with other classes and methods).

As for the questioned utility of the get_n_splits method, it is never a bad idea to be able to query such class objects in order to get back their parameter settings (on the contrary); imagine a situation where you have multiple different KFold objects, and you need to get their respective number of CV folds programmatically in the program flow.

desertnaut
  • 57,590
  • 26
  • 140
  • 166