30

I have the following data:

pd.DataFrame({'Group_ID':[1,1,1,2,2,2,3,4,5,5],
          'Item_id':[1,2,3,4,5,6,7,8,9,10],
          'Target': [0,0,1,0,1,1,0,0,0,1]})

   Group_ID Item_id Target
0         1       1      0
1         1       2      0
2         1       3      1
3         2       4      0
4         2       5      1
5         2       6      1
6         3       7      0
7         4       8      0
8         5       9      0
9         5      10      1

I need to split the dataset into a training and testing set based on the "Group_ID" so that 80% of the data goes into a training set and 20% into a test set.

That is, I need my training set to look something like:

    Group_ID Item_id Target
0          1       1      0
1          1       2      0
2          1       3      1
3          2       4      0
4          2       5      1
5          2       6      1
6          3       7      0
7          4       8      0

And test set:

Test Set
   Group_ID Item_id Target
8         5       9      0
9         5      10      1

What would be the simplest way to do this? As far as I know, the standard test_train_split function in sklearn does not support splitting by groups in a way where I can also indicate the size of the split (e.g. 80/20).

wulftone
  • 1,628
  • 1
  • 18
  • 34
Negative Correlation
  • 813
  • 1
  • 11
  • 26
  • what have you tried? using random selection can work. – Rob Feb 21 '19 at 00:47
  • @Rob Could you provide an example? I've relied so much on sklearn in the past that I'm completely lost with how to split any other way. – Negative Correlation Feb 21 '19 at 01:06
  • 1
    I can think of two ways but it depends on your complete dataset. 1)Lets say, you have 10 records in dataset then sort the dataset based on groupid and then just use train = df.iloc[:8,:], test = df.iloc[8:,:] 2) Use a conditional subset. Like make a list of groups . for exam- a = [5,6] and use df['groupid].isin(a) – Aditya Kansal Feb 21 '19 at 02:07
  • @AdityaKansal The data is about 4 gb in size. Could I use something like sklearn's GroupShuffleSplit? – Negative Correlation Feb 21 '19 at 02:28
  • Also you should use K-folding for training and testing. This is where you split you data into k (usually k=10) random sets, you then loop k times and each time you use (k-1) sets to train and 1 to test. (a different one each loop) This will make sure that all data is used to train, test. – Rob Feb 21 '19 at 14:57

1 Answers1

69

I figured out the answer. This seems to work:

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['Group_Id'])
train_inds, test_inds = next(split)

train = df.iloc[train_inds]
test = df.iloc[test_inds]
Fábio Perez
  • 23,850
  • 22
  • 76
  • 100
Negative Correlation
  • 813
  • 1
  • 11
  • 26
  • 10
    Shouldn't it be `n_splits=1`? It will still work with `n_splits=2`, but generate an extra split which is never used. – Iakov Davydov Feb 15 '22 at 11:39
  • 1
    The number of splits determines the relative sizes of train and test - if you want 50:50 then you need to use n_splits=2, 80:20 use n_splits=5 etc. – David Waterworth Mar 01 '23 at 04:19
  • What if we want to split keeping whole groups but at the same time you want to stratify (to keep the same proportion of classes). How do you mix groupwise and stratification splitting? – skan Jul 24 '23 at 17:20