2

i have a dataset where I want to split the data set based on the column values. At every iteration, the training set will include all data except those that belong to 2 values which will be kept for test set.

As an example, we have column x with values a, b, c, d, e and f.

At the moment I am doing a manual selection but since I want to try it for every possible combinations, I am not sure how best to do that.

train = df.loc[~df['x'].isin(['a','b'])]
test = df.loc[df['x'].isin(['a','b'])]

How do I change this code to consider all possible combinations?

I would also like to be able to print these combinations to see the combinations that were used for training and test sets.

martineau
  • 119,623
  • 25
  • 170
  • 301
MugB
  • 65
  • 1
  • 9
  • 1
    Do you want to split the data in test and train, or find all the possible combinations of test and train? – gregory Aug 13 '18 at 14:18
  • its to find all possible combinations of test and train with the test set having data belonging to 2 values of x in every iteration. i will try the suggestions provided below. – MugB Aug 13 '18 at 14:46

2 Answers2

4

Not tested, but how about using itertools.combinations like:

for holdouts in itertools.combinations(df['x'].unique(), 2):
    print(holdouts)
    train = df[~df['x'].isin(holdouts)]
    test = df[df['x'].isin(holdouts)]

You could save an evaluation by doing mask = df['x'].isin(holdouts)

Note that .loc isn't necessary for indexing on a boolean

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
0

iteratetools.combinations should work.

user3701435
  • 107
  • 8