Python fold indices for an outer cross-validation

Asked Jan 11 '17 at 09:37

Active Jan 11 '17 at 09:51

Viewed 1,102 times

I have a pandas dataframe df containing data from 2 classes. I would like to have randomly generated indices for a stratified K-fold cross-validation.

What I do at the moment is:

df_folds = np.array_split(df, 5)
for k in range(5):
    # We use 'list' to copy, in order to 'pop' later on
    df_train = list(df_folds)
    df_test  = df_train.pop(k)
    df_train = pd.concat(df_train)

However, this is not a stratified 5-fold cross-validation as it just splits the dataframe in 5.

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)
skf.get_n_splits(df)

print(skf)  

for train_index, test_index in skf.split(df):
   print("TRAIN:", train_index, "TEST:", test_index)

TypeError: split() takes at least 3 arguments (2 given)

edited Jan 11 '17 at 09:51

Tagc

8,736
7
61
114

asked Jan 11 '17 at 09:37

gabboshow

5,359
12
48
98

sklearn already provides this: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html have you tried this? – EdChum Jan 11 '17 at 09:38
I couldn't make it work with a pandas dataframe – gabboshow Jan 11 '17 at 09:39
1

Please show the erroneous code in your question as sklearn is compatible with pandas dataframes – EdChum Jan 11 '17 at 09:40
@EdChum please see the code that I tried – gabboshow Jan 11 '17 at 09:43
also see http://stackoverflow.com/q/38250710/2336654 – piRSquared Jan 11 '17 at 09:44
1

Error is clear the docs show it takes 2 args, you need to pass the columns that contain the data, and then the column that contains the class label: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html – EdChum Jan 11 '17 at 09:45

Python fold indices for an outer cross-validation

0 Answers0