how to split data into train and test based on a column values and shuffle the combinations?

Question

i have a dataset where I want to split the data set based on the column values. At every iteration, the training set will include all data except those that belong to 2 values which will be kept for test set.

As an example, we have column x with values a, b, c, d, e and f.

At the moment I am doing a manual selection but since I want to try it for every possible combinations, I am not sure how best to do that.

train = df.loc[~df['x'].isin(['a','b'])]
test = df.loc[df['x'].isin(['a','b'])]

How do I change this code to consider all possible combinations?

I would also like to be able to print these combinations to see the combinations that were used for training and test sets.

Do you want to split the data in test and train, or find all the possible combinations of test and train? — gregory, Aug 13 '18 at 14:18
its to find all possible combinations of test and train with the test set having data belonging to 2 values of x in every iteration. i will try the suggestions provided below. — MugB, Aug 13 '18 at 14:46

C8H10N4O2 · Accepted Answer · 2018-08-13T14:22:01.397

4

Not tested, but how about using itertools.combinations like:

for holdouts in itertools.combinations(df['x'].unique(), 2):
    print(holdouts)
    train = df[~df['x'].isin(holdouts)]
    test = df[df['x'].isin(holdouts)]

You could save an evaluation by doing mask = df['x'].isin(holdouts)

Note that .loc isn't necessary for indexing on a boolean

edited Aug 13 '18 at 14:22

answered Aug 13 '18 at 14:14

C8H10N4O2

18,312
8
98
134

score 0 · Answer 2 · answered Aug 13 '18 at 14:21

0

iteratetools.combinations should work.

answered Aug 13 '18 at 14:21

user3701435

107
8

It's `itertools` not `iteratetools` and my answer already says that. – C8H10N4O2 Aug 13 '18 at 14:22

how to split data into train and test based on a column values and shuffle the combinations?

2 Answers2