6

I am trying to run cross_val_score in sklearn with a split supplied by me. The sklearn documentation gives here the following example:

>>> from sklearn.model_selection import PredefinedSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> test_fold = [0, 1, -1, 1]
>>> ps = PredefinedSplit(test_fold)
>>> ps.get_n_splits()
2
>>> print(ps)       
PredefinedSplit(test_fold=array([ 0,  1, -1,  1]))
>>> for train_index, test_index in ps.split():
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]

I am having troubles with understanding this example. In particular,

  1. how does why does ps.get_n_splits() return 2 in this example; and
  2. why does the test_fold array lead to the splits shown at the bottom of the code snippet?

Additionally, I would like to ask, in this case, if I pass the ps object to the cross_val_score function in sklearn, will it perform cross validation with these two splits?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
clog14
  • 1,549
  • 1
  • 16
  • 32

2 Answers2

5

the number of split is the unique values in test_folder exclude (-1).

use this example using test_fold = [0, 1, -1, 1],

  • the zero index is 0, indicates the Test set is 0, the rest of 1, 2, 3 are Train set.

  --- > TRAIN: [1 2 3] TEST: [0]

- the 1st and 3rd index are 1, indicates the Test set are 1, 3, the rest of 0, 2 are Training set


  ---> TRAIN: [0 2] TEST: [1 3]
  • the 2nd index is -1, indicates no train/test split.
  • note, the integer value itself does make difference, so if test_folder = [5, 0, -1, 0], the split are the same

  --- > TRAIN: [1 2 3] TEST: [0]

finally , for a typical k-folder split, one can use test_fold = [0, 1, 2, 3]

Haibin Chen
  • 113
  • 1
  • 8
  • 5
    Can you format your answer better to make it clearer? – m13op22 Jun 27 '19 at 15:22
  • 2
    This should be added to the Sklearn documentation. – jds Jun 22 '21 at 16:57
  • I don't understand this explanation either. Why can't the scikit-learn documentation just show which rows of X actually go into the train data and which rows go into the test data? This should be a pretty straightforward concept, but explaining it with indices instead of actual hard examples is very unhelpful. – NaiveBae Jun 17 '22 at 22:55
0

From the class PredefinedSplit(BaseCrossValidator)

self.unique_folds = np.unique(self.test_fold)
self.unique_folds = self.unique_folds[self.unique_folds != -1]

It first determines how many unique integer is in the 1D array test_fold.

In the case of your example, it is [0, 1, -1].

It then removes -1 from the array of unique 1D array giving [0, 1].This implies a 2-fold cross validation (len(self.unique_folds)). The distribution of indices is subjected to 3 constraints in this case.

  1. Constraint 1: Since index 2 is excluded from the test set as it is set to -1, we are left with indices [0, 1, 3] to be distributed amongst the 2 arrays/lists that represent the 2-fold cross validation.

  2. Constraint 2: Indices [1, 3] have to be together because they both have a value of 1 in the variable self.test_fold.

  3. Constraint 3: The final constraint is that 0 and be together with [1, 3] as they have different values in self.test_fold.

Each chosen number cannot be replaced (technically this is also a constraint), therefore: if 0 goes to TRAIN, then 2 must in be TRAIN as it is set to -1. Hence we have TRAIN=[0, 2] and the rest goes into TEST=[1, 3] due to the constraints mentioned in the previous sentence.

Due to the constraints, the only other possibility is TRAIN = [1, 3] + [2] and TEST=[0]

wontleave
  • 177
  • 1
  • 10