The problem: I need to divide several different, large dataframes (e.g. 50k rows) into smaller chunks which each have the same number of rows. However, I don't want to have to manually set the size of the chunks for each dataset. Instead, I want code that:
- Examines the length of the dataframe and determines how many chunks of roughly a few thousand rows the original dataframe can be broken into
- Minimizes the number of "leftover" rows that must be discarded
The answers provided here are relevant: Split a vector into chunks in R
However, I don't want to have to manually set a chunk size. I want the code to find the "optimal" chunk size that will minimize the remainder.
Example: (Based on Harlan's answer at above link)
df <- rnorm(20752)
max <- 20
x <- seq_along(df)
df <- split(df, ceiling(x/max))
str(df)
> List of 5
> $ 1: num [1:5000] -1.4 -0.496 -1.185 -2.071 -1.118 ...
> $ 2: num [1:5000] 0.522 1.607 -2.228 -2.044 0.997 ...
> $ 3: num [1:5000] 0.295 0.486 -1.085 0.515 0.96 ...
> $ 4: num [1:5000] 0.695 -0.58 -1.676 1.052 1.266 ...
> $ 5: num [1:752] -0.6468 0.1731 0.5788 -0.0584 0.8479 ...
If I had chosen a chunk size of 4100 rows, I would have 5 chunks with a remainder of 252 rows. That's more desirable because I would discard fewer datapoints. As long as the chunks are a few thousand rows at least, I don't care exactly what size they are.