1

I want to split raw dataframe into 3 subgroups: train, test, validate

I see three solutions, but afraid they are not correct and may cause bottle neck effect

1) add dictionary with keys

my_dict = {'train':raw_df.loc[start:end], 'test':raw_df.loc[start:end],
            'val':raw_df.loc[start:end]}

2) create three dataframes

train_df = df.loc[start:end]
test_df = df.loc[start:end]
val_df = df.loc[start:end]

3) add new column with one of three random values random

df['train/test/val'] = pd.Series('train', index=df.index)

ALso, will adding dataframe in dictionary cause: bottle_neck effect of loosing performance advantages of dataframe being help in dictionary or list? Adding new columns in theory is increasing dimension of data Creating new dataframes I think is the worst variant cause it will eat tons of memory

Demaunt
  • 1,183
  • 2
  • 16
  • 26
  • 1
    Adding a new column will not increase the dimension as you will not include it in a learning process. If you are worried about the space (though I don't think it will take much space) you can just store the cutpoints (i.e. 0, 25, 70, 100) and when needed use the slices of the dataframe (df[0:25], df[25:70] etc). Dividing them into three different dataframes will also not increase the memory usage much. – ayhan Jul 01 '16 at 07:22

3 Answers3

3

Adding a new column won't be eating lots of memorys but you will add a slicing cost whenever you want to access one of your three sets. Creating new datraframes leads this slicing part to be done only once.

For this you can use sample. Let's you want 80% of your dataframe in train, and 10% in test and validate:

train = df.sample(frac = 0.8)
test = df.drop(train.index).sample(frac = 0.5)
validate = df.drop(train.index).drop(test.index)
ysearka
  • 3,805
  • 5
  • 20
  • 41
0

Technically, i don't think you need to have three dataframes if you want to test your machine learning model. Why? Because you build you model on your training_set and you need to validate it with your validation_set. You will only use your test_set once your model is validated. Also, your test_set doesn't contain the Y label.

Several libraries contain functions to split your data easily.

Without using any external library you can do this:

msk = np.random.rand(len(df)) < 0.8
train = df[msk]
validation = df[~msk]

(answer from: How do I create test and train samples from one dataframe with pandas?)

Hope this helps!

Community
  • 1
  • 1
Elliott Addi
  • 370
  • 4
  • 18
  • The train/test/validate is a way of splitting used for instance in kaggle's competitions. Want you train on this kind of plateformes, it can be useful to split your data in three sets. As for the rest of the time, I totally agree with you. – ysearka Jul 01 '16 at 07:37
  • @ysearka I checked out Kaggle some time ago, do you know why they would want three splits? – Elliott Addi Jul 01 '16 at 07:40
  • 1
    You can use `train`/`test` to cross-validate your model, and keep a `validate` set (which can be common with other people) to compare your results. Which permits to have an objective comparison on the same set. – ysearka Jul 01 '16 at 07:47
0

A pure pandas solution for only a single iteration would be to use sample, and exclude the index of your sample for your next iteration with pd.Index.difference:

validation = df.sample(validation_size)
# Get the other part of the dataframe
train_test = df.loc[df.index.difference(validation.index)]

test = train_test.sample(test_size)
train = train_test.loc[train_test.index.difference(test.index)]

Note that validation_size and test_size are the number of rows you want for your validation and test frames, respectively.

Sklearn also has great functionality for doing splits in a loop for easier cross validation. Documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.ShuffleSplit.html

breucopter
  • 321
  • 1
  • 6