4

I would like to split my dataframe into training and test data. There is a great post here on how to do this randomly. However, I need to split it based on names of the observations to make sure (for instance) 2/3 observations with sample name 'X' are allocated to training data and 1/3 of the observations with sample name 'X are allocated to test data.

Here is the top of my DF:

             136       137       138       139  141  143  144  145       146  \
Sample                                                                         
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.140901   
HC10    0.000000  0.000000  0.000000  0.267913  0.0  0.0  0.0  0.0  0.000000   
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.174445   
HC11    0.059915  0.212442  0.255549  0.000000  0.0  0.0  0.0  0.0  0.000000   
HC11    0.000000  0.115988  0.144056  0.070028  0.0  0.0  0.0  0.0  0.000000   

        147       148  149       150  151       152      154       156  158  \
Sample                                                                        
HC10    0.0  0.189937  0.0  0.052635  0.0  0.148751  0.00000  0.000000  0.0   
HC10    0.0  0.000000  0.0  0.267764  0.0  0.000000  0.00000  0.000000  0.0   
HC10    0.0  0.208134  0.0  0.130212  0.0  0.165507  0.00000  0.000000  0.0   
HC11    0.0  0.000000  0.0  0.000000  0.0  0.000000  0.06991  0.102209  0.0   
HC11    0.0  0.065779  0.0  0.072278  0.0  0.060815  0.00000  0.060494  0.0   

             160  173  
Sample                 
HC10    0.051911  0.0  
HC10    0.281227  0.0  
HC10    0.000000  0.0  
HC11    0.000000  0.0  
HC11    0.073956  0.0

Sample is the index of the dataframe, the rest is numerical.

If I use a solution such as:

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

as was suggested here then samples such as HC10 in my df may all be allocated to training data but I will not be able to test my model on them. Does anyone know a quick way (ideally using pandas) that will partition the data in this way?

Many thanks

user3062260
  • 1,584
  • 4
  • 25
  • 53
  • The normal way to deal with this type of worry is to use cross-validation, whereby you train and test on multiple random splits of the data. – Stev Feb 15 '18 at 14:38
  • Are you looking for [StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)? – MaxU - stand with Ukraine Feb 15 '18 at 14:40
  • I've heard about using cross-validation (I'm not a statistician so am not very familiar with it). So would you suggest partitioning as above and then running cross validation? Would I just do this several times in a loop and pipe into a model? Or is there another approach which is considered best practice? – user3062260 Feb 15 '18 at 14:44

1 Answers1

3

You can do the sampling group-wize, to keep each group balanced. I will modify your small example:

import pandas as pd
df = pd.DataFrame({
    'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
    'x':range(10)
})

train = df.reset_index(                  # need to keep the index as a column
    ).groupby('group'                    # split by "group"
    ).apply(lambda x: x.sample(frac=0.6) # in each group, do the random split
    ).reset_index(drop=True              # index now is group id - reset it
    ).set_index('index')                 # reset the original index
test = df.drop(train.index)              # now we can subtract it from the rest of data

Another solution would be using stratified sampling algorithms e.g. from scikit-learn.

David Dale
  • 10,958
  • 44
  • 73
  • So is this basically adding another variable to assign each observation to a group, then partitioning based on that? – user3062260 Feb 15 '18 at 15:46
  • Yes, exactly. The goal is to distribute this variable as evenly as possible. – David Dale Feb 15 '18 at 16:48
  • I had tried something very similar to this originally before I posted here but the problem is its not very dynamic. i.e. if I want to then take a different train/test set for example for cross-validation as others have mentioned then the dataset has a fixed group label - unless I miss understood your code? – user3062260 Feb 15 '18 at 17:13
  • Group label is fixed, but it may be split multiple times in different ways (try it!). – David Dale Feb 15 '18 at 17:53