I have a pd.DataFrame
with a similar structure to the sample below:
index x y z
0 x0 y0 None
1 x1 y1 None
2 x2 y2 None
3 x3 y3 None
4 x4 y4 None
5 x5 y5 None
6 x6 y6 None
My goal is to create 3 subsets of the DataFrame:
Group1
is a training set can be use to train a model to predict z with x and y;Group2
is a validation set that is used to evaluate the accuracy of the model (or different models/tunings of parameters) trained in Group1, and I will fill out the correct value of z for bothGroup1
and2
.Group3
is held until a model is chosen to predict z.
In this case, what would be the most efficient way to do the assignment? I was thinking about simply create sub groups within one DataFrame, as below:
index x y z group
- - - - - - - - - - - - - - - - - - - -
0 x0 y0 None training
1 x1 y1 None validation
2 x2 y2 None held out
3 x3 y3 None held out
4 x4 y4 None validation
5 x5 y5 None training
6 x6 y6 None held out
But the tips on random assignment I've seen elsewhere normally create a new DataFrame. Is it because this is more feasible?
rows = np.random.choice(df.index.values, 10)
sampled_df = df.ix[rows]
Also, since I want to sample 3 groups instead of 2 at once, I am not sure what is the best way to sample without replacement.