0

I have a data frame as following, and I need to split it into training and test set in a way that if I have one specific ID in train it should not be repeated in test set.

   Row  ID  AGE GENDER  TIME  CODE
    0    1   66      M     1     0
    1    1   66      M     2     0
    2    1   66      M     3     1
    3    2   20      F     1     0
    4    2   20      F     2     0
    5    2   20      F     3     0
    6    2   20      F     4     0
    7    3   18      F     1     0
    8    3   18      F     2     0
    9    3   18      F     3     0
    10   3   18      F     4     1

the desired output in training set should be like this

  Row   ID  AGE GENDER  TIME  CODE
    0    1   66      M     1     0
    1    1   66      M     2     0
    2    1   66      M     3     1
    3    2   20      F     1     0
    4    2   20      F     2     0
    5    2   20      F     3     0
    6    2   20      F     4     0

and test set should be like

   Row   ID  AGE GENDER  TIME  CODE
    0    3   18      F     1     0
    1    3   18      F     2     0
    2    3   18      F     3     0
    3    3   18      F     4     1

how is it possible doing this in pandas python?

Thanks in advance

Mostafa Alishahi
  • 320
  • 5
  • 13
  • 1
    please read [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Brown Bear Jun 01 '18 at 09:30
  • `df_train[~df_train['ID'].isin(df_test['ID'])]` ? – jpp Jun 01 '18 at 09:31
  • @jpp I have a df then i need to split it into df_train and df_test according to the condition, your suggestion doesn't work as I tested it. any ideas? – Mostafa Alishahi Jun 01 '18 at 09:43
  • @MohamedThasinah thanks for your comment, but my case is different from that one if you have a look at it. and I need to split according to each group Id and I don't know how many rows do I have in each group. so I guess I do need to use groupby somehow. but don't know how to – Mostafa Alishahi Jun 01 '18 at 09:46

1 Answers1

1

try this,

ids=df['ID'].unique()
t= ids[:int(round(len(ids)*0.60))]

train=df[df['ID'].isin(t)]
test=df[~df['ID'].isin(t)]

Input:

    Row  ID  AGE GENDER  TIME  CODE
0     0   1   66      M     1     0
1     1   1   66      M     2     0
2     2   1   66      M     3     1
3     3   2   20      F     1     0
4     4   2   20      F     2     0
5     5   2   20      F     3     0
6     6   2   20      F     4     0
7     7   3   18      F     1     0
8     8   3   18      F     2     0
9     9   3   18      F     3     0
10   10   3   18      F     4     1

Output:

Train:

   Row  ID  AGE GENDER  TIME  CODE  flag
0    0   1   66      M     1     0     0
1    1   1   66      M     2     0     0
2    2   1   66      M     3     1     0
3    3   2   20      F     1     0     1
4    4   2   20      F     2     0     1
5    5   2   20      F     3     0     1
6    6   2   20      F     4     0     1

Test:

   Row  ID  AGE GENDER  TIME  CODE  flag
7     7   3   18      F     1     0     2
8     8   3   18      F     2     0     2
9     9   3   18      F     3     0     2
10   10   3   18      F     4     1     2
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111