0

I have a pandas dataframe df containing data from 2 classes. I would like to have randomly generated indices for a stratified K-fold cross-validation.

What I do at the moment is:

df_folds = np.array_split(df, 5)
for k in range(5):
    # We use 'list' to copy, in order to 'pop' later on
    df_train = list(df_folds)
    df_test  = df_train.pop(k)
    df_train = pd.concat(df_train)

However, this is not a stratified 5-fold cross-validation as it just splits the dataframe in 5.

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)
skf.get_n_splits(df)

print(skf)  

for train_index, test_index in skf.split(df):
   print("TRAIN:", train_index, "TEST:", test_index)

TypeError: split() takes at least 3 arguments (2 given)
Tagc
  • 8,736
  • 7
  • 61
  • 114
gabboshow
  • 5,359
  • 12
  • 48
  • 98
  • sklearn already provides this: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html have you tried this? – EdChum Jan 11 '17 at 09:38
  • I couldn't make it work with a pandas dataframe – gabboshow Jan 11 '17 at 09:39
  • 1
    Please show the erroneous code in your question as sklearn is compatible with pandas dataframes – EdChum Jan 11 '17 at 09:40
  • @EdChum please see the code that I tried – gabboshow Jan 11 '17 at 09:43
  • also see http://stackoverflow.com/q/38250710/2336654 – piRSquared Jan 11 '17 at 09:44
  • 1
    Error is clear the docs show it takes 2 args, you need to pass the columns that contain the data, and then the column that contains the class label: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html – EdChum Jan 11 '17 at 09:45

0 Answers0