Split data into train/ test files such that at least one sample is picked for both the files

Question

I have a csv file which is read into a dataframe. I split the it into training and test files based on the values of one column.

Let us say the column is called "category" and it has several category names as column values such as cat1,cat2,cat3 and so on which repeat more than once.

I need to split the files such that each category name comes in both the files at least once.

So far I am able to split the file into two based on ratio. I have tried many options but this is the best one so far.

  def executeSplitData(self):
      data = self.readCSV() 
      df = data
      if self.column in data:
         train, test = train_test_split(df, stratify = None, test_size=0.5)
         self.writeTrainFile(train)
         self.writeTestFile(test)

I do not fully understand the stratify option in test_train_split. Please help. Thanks

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html — chrisaycock, Jun 22 '16 at 14:03

score 3 · Answer 1 · answered Jun 22 '16 at 14:29

I tried to use it according to the docs and couldn't get stratify to work.

Setup

from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np

np.random.seed([3,1415])
p = np.arange(1, 5.) / np.arange(1, 5.).sum()
df = pd.DataFrame({'category': np.random.choice(('cat1', 'cat2', 'cat3', 'cat4'), (1000,), p=p),
                   'x': np.random.rand(1000), 'y': np.random.choice(range(2), (1000,))})


def get_freq(s):
    return s.value_counts() / len(s)

print get_freq(df.category)

cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64

If I try to:

train, test = train_test_split(df, stratify=df.category, test_size=.5)
train, test = train_test_split(df, stratify=df.category.values, test_size=.5)
train, test = train_test_split(df, stratify=df.category.values.tolist(), test_size=.5)

All returned a:

TypeError: Invalid parameters passed:

The docs say:

stratify : array-like or None (default is None)

I can't think why this wouldn't work.

I decided to build a work around:

def stratify_train_test(df, stratifyby, *args, **kwargs):
    train, test = pd.DataFrame(), pd.DataFrame()
    gb = df.groupby(stratifyby)
    for k in gb.groups:
        traink, testk = train_test_split(gb.get_group(k), *args, **kwargs)
        train = pd.concat([train, traink])
        test = pd.concat([test, testk])
    return train, test

train, test = stratify_train_test(df, 'category', test_size=.5)
# this also works
# train, test = stratify_train_test(df, df.category, test_size=.5)

print get_freq(train.category)
print len(train)

Name: category, dtype: float64
cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64
500

print get_freq(test.category)
print len(test)

cat4    0.400
cat3    0.284
cat2    0.208
cat1    0.108
Name: category, dtype: float64
500

Hi piRsquared.. thanks for the help. But I need a bit more info.. each category is added to both the files at the end but the splitting is not happening. — Anila A, Jun 23 '16 at 08:09
print test1 category Column2 0 1 A 1 1 A 2 2 B 3 3 C 4 3 C 5 4 D 1 1 A 2 2 B 4 3 C 5 4 D >>> print train1 category Column2 0 1 A 1 1 A 2 2 B 3 3 C 4 3 C 5 4 D 0 1 A 3 3 C — Anila A, Jun 23 '16 at 08:10

Split data into train/ test files such that at least one sample is picked for both the files

1 Answers1

Setup