1

The problem: I need to divide several different, large dataframes (e.g. 50k rows) into smaller chunks which each have the same number of rows. However, I don't want to have to manually set the size of the chunks for each dataset. Instead, I want code that:

  • Examines the length of the dataframe and determines how many chunks of roughly a few thousand rows the original dataframe can be broken into
  • Minimizes the number of "leftover" rows that must be discarded

The answers provided here are relevant: Split a vector into chunks in R

However, I don't want to have to manually set a chunk size. I want the code to find the "optimal" chunk size that will minimize the remainder.

Example: (Based on Harlan's answer at above link)

df <- rnorm(20752)
max <- 20
x <- seq_along(df)
df <- split(df, ceiling(x/max))
str(df)
> List of 5
> $ 1: num [1:5000] -1.4 -0.496 -1.185 -2.071 -1.118 ...
> $ 2: num [1:5000] 0.522 1.607 -2.228 -2.044 0.997 ...
> $ 3: num [1:5000] 0.295 0.486 -1.085 0.515 0.96 ...
> $ 4: num [1:5000] 0.695 -0.58 -1.676 1.052 1.266 ...
> $ 5: num [1:752] -0.6468 0.1731 0.5788 -0.0584 0.8479 ...

If I had chosen a chunk size of 4100 rows, I would have 5 chunks with a remainder of 252 rows. That's more desirable because I would discard fewer datapoints. As long as the chunks are a few thousand rows at least, I don't care exactly what size they are.

Community
  • 1
  • 1
LCM
  • 292
  • 4
  • 14
  • You need to decide at least maximum and minimum number of rows you consider "good" for a sub-data.frame. You can't say "roughly a few thousand" to an algorithm... – digEmAll Aug 27 '14 at 20:51
  • This problem isn't well-defined without some limits on the size of chunks or the number of chunks you end up with. For example, using a chunk size equal to the greatest prime factor of `len(df)` (not equal to `len(df)`) will give you zero leftover rows, but your chunk size might be small (which I assume is not desired). Alternatively, using a chunk size equal to `len(df)` gives zero leftover rows as well, but results in a very large chunk (also presumably undesired). – tkmckenzie Aug 27 '14 at 20:51
  • Indeed, I was ambiguous. Could we say a minimum of 4000 and a maximum of 10000 rows? – LCM Aug 27 '14 at 20:56

1 Answers1

3

Here's a brute force approach (but very fast) :

# number of rows of your data.frame (from your example... )
nrows <- 20752

# acceptable range for sub-data.frame size
subSetSizes <- 4000:10000

remainders <- nrows %% subSetSizes 
minIndexes <- which(remainders == min(remainders))
chunckSizesHavingMinRemainder <- subSetSizes[minIndexes]

# > chunckSizesHavingMinRemainder
# [1] 5188

# the remainder of 20752 / 5188 is indeed 0 (the only minimum)
# nrows %% 5188 
# > [1] 0
digEmAll
  • 56,430
  • 9
  • 115
  • 140