create three samples groups in a pd.DataFrame

Question

I have a pd.DataFrame with a similar structure to the sample below:

index  x     y     z
0      x0    y0    None
1      x1    y1    None
2      x2    y2    None
3      x3    y3    None
4      x4    y4    None
5      x5    y5    None
6      x6    y6    None

My goal is to create 3 subsets of the DataFrame:

Group1 is a training set can be use to train a model to predict z with x and y;
Group2 is a validation set that is used to evaluate the accuracy of the model (or different models/tunings of parameters) trained in Group1, and I will fill out the correct value of z for both Group1 and 2.
Group3 is held until a model is chosen to predict z.

In this case, what would be the most efficient way to do the assignment? I was thinking about simply create sub groups within one DataFrame, as below:

index  x     y     z       group
- - - - - - - - - - - - - - - - - - - - 
0      x0    y0    None    training
1      x1    y1    None    validation
2      x2    y2    None    held out
3      x3    y3    None    held out
4      x4    y4    None    validation
5      x5    y5    None    training
6      x6    y6    None    held out

But the tips on random assignment I've seen elsewhere normally create a new DataFrame. Is it because this is more feasible?

rows = np.random.choice(df.index.values, 10)
sampled_df = df.ix[rows]

Also, since I want to sample 3 groups instead of 2 at once, I am not sure what is the best way to sample without replacement.

Here's a similar question http://stackoverflow.com/q/38250710/2285236 — ayhan, Oct 30 '16 at 18:57

score 2 · Accepted Answer · edited May 23 '17 at 12:19

You could use

df['group'] = np.random.choice(
    np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)

to assign a training/validation/held out label to each row. The (2,2,3) above indicates the number of rows of each type you wish to have. Since each row should get a label, the sum of the tuple should equal len(df).

Is assigning labels better than creating sub-DataFrames?

If you assign labels, you'll end up with code like:

df['group'] = np.random.choice(
    np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)
goodness = dict()
params = dicts()
for model in models: 
    params[model] = fit(model, df.loc[df['group'] == 'train'])
    goodness[model] = validate(model, params[model], df.loc[df['group'] == 'validation'])
best_model = max(models, key=goodness.get)
result = process(best_model, params[best_model], df.loc[df['group'] == 'held_out'])

If you split df (using MaxU's solution), you'll end up with code like:

train, validate, held_out = np.split(df.sample(frac=1), [2,4])
goodness = dict()
params = dicts()
for model in models: 
    params[model] = fit(model, train)
    goodness[model] = validate(model, params[model], validate)
best_model = max(models, key=goodness.get)
result = process(best_model, params[best_model], held_out)

Each time Python encounters df['group'] == 'train', the entire Series df['group'] is scanned -- an O(N) operation. df.loc[f['group'] == 'train'] then copies rows from df to form a new sub-DataFrame. Since this is done in a loop, for model in models, and is done two times for each loop, this O(N) operation is performed 2*len(model) times.

In contrast, if you split the DataFrame at the very beginning, then the copying is only done once. So MaxU's code is faster.

On the other hand, using the labels to create sub-DataFrames on demand will save a bit of memory since you won't be instantiating all three sub-DataFrames at once. However, unless you are really tight on memory you'll probably want faster code than more memory efficient code. So if that's the case, use MaxU's solution.

Of course, you could use

df['group'] = np.random.choice(
    np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)
train, validate, held_out = [df.loc[df['group'] == label] for label in ['train', 'validation', 'held out']]

instead of

train, validate, held_out = np.split(df.sample(frac=1), [2,4])

but there is no speed or memory advantage to doing it this way either. You'd still be scanning and copying from the DataFrame three times instead of once. So again MaxU's solution should be preferred.

thanks so much for the detailed explanation! really helped me understand the differences in each approach. — Carl H, Oct 31 '16 at 22:24

create three samples groups in a pd.DataFrame

1 Answers1