Split dataframe into equal parts based on length of the dataframe

Question

The problem: I need to divide several different, large dataframes (e.g. 50k rows) into smaller chunks which each have the same number of rows. However, I don't want to have to manually set the size of the chunks for each dataset. Instead, I want code that:

Examines the length of the dataframe and determines how many chunks of roughly a few thousand rows the original dataframe can be broken into
Minimizes the number of "leftover" rows that must be discarded

The answers provided here are relevant: Split a vector into chunks in R

However, I don't want to have to manually set a chunk size. I want the code to find the "optimal" chunk size that will minimize the remainder.

Example: (Based on Harlan's answer at above link)

df <- rnorm(20752)
max <- 20
x <- seq_along(df)
df <- split(df, ceiling(x/max))
str(df)
> List of 5
> $ 1: num [1:5000] -1.4 -0.496 -1.185 -2.071 -1.118 ...
> $ 2: num [1:5000] 0.522 1.607 -2.228 -2.044 0.997 ...
> $ 3: num [1:5000] 0.295 0.486 -1.085 0.515 0.96 ...
> $ 4: num [1:5000] 0.695 -0.58 -1.676 1.052 1.266 ...
> $ 5: num [1:752] -0.6468 0.1731 0.5788 -0.0584 0.8479 ...

If I had chosen a chunk size of 4100 rows, I would have 5 chunks with a remainder of 252 rows. That's more desirable because I would discard fewer datapoints. As long as the chunks are a few thousand rows at least, I don't care exactly what size they are.

You need to decide at least maximum and minimum number of rows you consider "good" for a sub-data.frame. You can't say "roughly a few thousand" to an algorithm... — digEmAll, Aug 27 '14 at 20:51
This problem isn't well-defined without some limits on the size of chunks or the number of chunks you end up with. For example, using a chunk size equal to the greatest prime factor of `len(df)` (not equal to `len(df)`) will give you zero leftover rows, but your chunk size might be small (which I assume is not desired). Alternatively, using a chunk size equal to `len(df)` gives zero leftover rows as well, but results in a very large chunk (also presumably undesired). — tkmckenzie, Aug 27 '14 at 20:51
Indeed, I was ambiguous. Could we say a minimum of 4000 and a maximum of 10000 rows? — LCM, Aug 27 '14 at 20:56

digEmAll · Accepted Answer · 2014-08-27T21:09:44.383

3

Here's a brute force approach (but very fast) :

# number of rows of your data.frame (from your example... )
nrows <- 20752

# acceptable range for sub-data.frame size
subSetSizes <- 4000:10000

remainders <- nrows %% subSetSizes 
minIndexes <- which(remainders == min(remainders))
chunckSizesHavingMinRemainder <- subSetSizes[minIndexes]

# > chunckSizesHavingMinRemainder
# [1] 5188

# the remainder of 20752 / 5188 is indeed 0 (the only minimum)
# nrows %% 5188 
# > [1] 0

edited Aug 27 '14 at 21:09

answered Aug 27 '14 at 21:04

digEmAll

56,430
9
115
140

3

or just use `which.min(remainders)` if you don't care about seeing all the possibilities ... – Ben Bolker Aug 27 '14 at 21:15
would finding a list of primes between 4000 and 10000 help? – Ben Bolker Aug 27 '14 at 21:16

Split dataframe into equal parts based on length of the dataframe

1 Answers1

Linked