0

I need to split my dataset df randomly into two sets (proportion 70:30) using batches of 2. By "batch", I mean that the 2 (batch size) sequential rows should always belong to the same set.

  col1    col2    col3
  1       0.5     10
  1       0.3     11
  5       1.4     1
  3       1.5     2
  1       0.9     10
  3       0.4     7
  1       1.2     9
  3       0.1     11

Sample result (due to randomness, the outputs might be different, but this serves as an example):

set1
      col1    col2    col3
      1       0.5     10
      1       0.3     11
      1       0.9     10
      3       0.4     7
      1       1.2     9
      3       0.1     11

set2
      5       1.4     1
      3       1.5     2

I know how to split data randomly using batches of 1:

import numpy as np

msk = np.random.rand(len(df)) < 0.7
set1 = df[msk]
set2 = df[~msk] 

However, not sure how to introduce a flexible batch.

Thanks.

Update:

This is what I currently have, but the last line of code fails. set1 and set2 should be pandas DataFrames.

n = 3
df_batches = [df[i:i+n] for i in range(0, df.shape[0],n)]

set1_idx = np.random.randint(len(df_batches), size=int(0.7*len(df_batches)))
set2_idx = np.random.randint(len(df_batches), size=int(0.3*len(df_batches)))
set1, set2 = df_batches[set1_idx,:], df_batches[set2_idx,:]
Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • You can probably use `Series.shift()` or `DataFrame.shift()`. See: https://stackoverflow.com/questions/52711358/apply-function-on-pairs-of-rows-in-pandas-dataframe, https://stackoverflow.com/questions/51443725/pandas-iterate-over-dataframe-row-pairs – AMC Feb 22 '20 at 19:17
  • @AMC: Thanks. I need a flexible solution that would allow changing a batch size. – Fluxy Feb 22 '20 at 19:21
  • Hmm, is there anything in the data itself which determines the number of batches? – AMC Feb 22 '20 at 19:22
  • @AMC: No, there is nothing that could determine this. – Fluxy Feb 22 '20 at 19:28
  • Ah that's too bad. – AMC Feb 22 '20 at 19:31
  • Hello, what do you mean by the 2 sequentials row should always belong to the same set, what is a sequential raw ? – Omar Aldakar Feb 22 '20 at 21:10

2 Answers2

0

for more randomness you can use the numpy function np.random.permutation. Here is an example :

batchsizes = np.asarray([0.7])
permutations = np.random.permutation(len(df))

batchsizes *= len(permutations)
slices = np.split(permutations, batchsizes.round().astype(np.int))
batchs = [df.loc[s] for s in slices] 

This have a better randomness because it no longer depends on the initial form of your dataframe. And you can have more than 2 part. For example you can take batchsizes = np.asarray([0.3,0.1,0.3]) and it will slice in a proportions of 30:10:30:30.

Omar Aldakar
  • 505
  • 2
  • 8
0

Here's a function doing what you want based on a random integer and then taking the 30%:

def split_data(df, batchsize):
    x = np.random.randint(0, len(df))
    idx = round(len(df) * batchsize)

    # so we don't get out of the bounds of our index
    if x + idx > len(df):
        x = x - idx

    batch1 = df.loc[np.arange(x, x+idx)]
    batch2 = df.loc[~df.index.isin(batch1.index)]

    return batch1, batch2

df1, df2 = split_data(df, 0.3)
print(df1, '\n')
print(df2)

   col1  col2  col3
4     1   0.9    10
5     3   0.4     7 

   col1  col2  col3
0     1   0.5    10
1     1   0.3    11
2     5   1.4     1
3     3   1.5     2
6     1   1.2     9
7     3   0.1    11
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks. Which parameter define the batch size? – Fluxy Feb 22 '20 at 20:04
  • Can you please explain your solution? Are the rows always sequential? (this is a mandatory requirement). I mean that it would be wrong if `df1` included row 3 and then row 5. – Fluxy Feb 22 '20 at 20:06
  • I added parameter `batchsize` and rows will always be sequential because of `np.arange` – Erfan Feb 22 '20 at 20:14
  • Ok, it's a bit unclear to me why `batchsize` is `0.3`? In my case, `batchsize` is integer, which represents the number of rows that should be sequential. For example, if `batchsize` is 10, then it is necessary to use 70% of batches of 10 for set1 and 30% of batches of 10 as set2. – Fluxy Feb 22 '20 at 20:19
  • Then I think I misunderstand you, I thought you want a batch of 30% of your data which is sequential. That's the way it looks from your example. – Erfan Feb 22 '20 at 20:21
  • Sorry. Please check my update, where I posted my current code. Probably it will give more clarity. – Fluxy Feb 22 '20 at 20:22