I need to split my dataset df
randomly into two sets (proportion 70:30) using batches of 2. By "batch", I mean that the 2 (batch size) sequential rows should always belong to the same set.
col1 col2 col3
1 0.5 10
1 0.3 11
5 1.4 1
3 1.5 2
1 0.9 10
3 0.4 7
1 1.2 9
3 0.1 11
Sample result (due to randomness, the outputs might be different, but this serves as an example):
set1
col1 col2 col3
1 0.5 10
1 0.3 11
1 0.9 10
3 0.4 7
1 1.2 9
3 0.1 11
set2
5 1.4 1
3 1.5 2
I know how to split data randomly using batches of 1:
import numpy as np
msk = np.random.rand(len(df)) < 0.7
set1 = df[msk]
set2 = df[~msk]
However, not sure how to introduce a flexible batch.
Thanks.
Update:
This is what I currently have, but the last line of code fails. set1
and set2
should be pandas DataFrames.
n = 3
df_batches = [df[i:i+n] for i in range(0, df.shape[0],n)]
set1_idx = np.random.randint(len(df_batches), size=int(0.7*len(df_batches)))
set2_idx = np.random.randint(len(df_batches), size=int(0.3*len(df_batches)))
set1, set2 = df_batches[set1_idx,:], df_batches[set2_idx,:]