5

My question is extremely closely related to this one:

Split a vector into chunks in R

I'm trying to split a large vector into known chunk sizes and it's slow. A solution for vectors with even remainders is here:

A quick solution when a factor exists is here:

Split dataframe into equal parts based on length of the dataframe

I would like to handle the case of no (large) factor existing, as I would like fairly large chunks.

My example for a vector much smaller than the one in my real life application:

d <- 1:6510321
# Sloooow
chunks <- split(d, ceiling(seq_along(d)/2000))
Community
  • 1
  • 1
kennyB
  • 1,963
  • 3
  • 17
  • 22
  • 3
    That takes me 5 seconds on my modest work machine here. How fast do you need when processing 6.5M cases? I'm serious - I understand this could be a pain if you're doing it a lot in a function. – thelatemail Jun 25 '15 at 23:33
  • Sorry I didn't want my example to take too long. The actual application is 100 times bigger. So quickly would be great! Thanks – kennyB Jun 25 '15 at 23:35
  • Righty-o then, that changes things a bit. So we're talking a vector of length 600M? – thelatemail Jun 25 '15 at 23:46
  • 5
    Is the order important? Replacing the second argument with 1:(length(d)/2000) will (surprisingly) speed things up by almost 30x. But it will result in a different ordering, I believe. – Cliff AB Jun 25 '15 at 23:58
  • The ordering isn't all that important no, this could work. – kennyB Jun 26 '15 at 00:08
  • 1
    @CliffAB - That code would recycle a vector of length 2000 over the length of `d` instead of needing to create a whole other vector of length `d` – thelatemail Jun 26 '15 at 01:00

2 Answers2

3

Using llply from the plyr package I was able to reduce the time.

chunks <- function(d, n){      
    chunks <- split(d, ceiling(seq_along(d)/n))
    names(chunks) <- NULL
    return(chunks)
 }

require(plyr)
plyrChunks <- function(d, n){
     is <- seq(from = 1, to = length(d), by = ceiling(n))
     if(tail(is, 1) != length(d)) {
          is <- c(is, length(d)) 
     } 
     chunks <- llply(head(seq_along(is), -1), 
                     function(i){
                         start <-  is[i];
                         end <- is[i+1]-1;
                         d[start:end]})
    lc <- length(chunks)
    td <- tail(d, 1)
    chunks[[lc]] <- c(chunks[[lc]], td)
    return(chunks)
 }

 # testing
 d <- 1:6510321
 n <- 2000

 system.time(chks <- chunks(d,n))
 #    user  system elapsed 
 #   5.472   0.000   5.472 

 system.time(plyrChks <- plyrChunks(d, n))
 #    user  system elapsed 
 #   0.068   0.000   0.065 

 identical(chks, plyrChks)
 # TRUE

You can speed even more using the .parallel parameter from the llpyr function. Or you can add a progress bar using the .progress parameter.

celacanto
  • 315
  • 2
  • 11
2

A speed improvement from the parallel package:

chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))
kennyB
  • 1,963
  • 3
  • 17
  • 22
  • 1
    Hi @kennyB. Cool function you found there, so +1 for that. I had a look at the code of `splitIndices` and one can see it makes use of `split` & `cut` functions, like the solutions proposed [here](https://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r), which you also referenced. Then, not sure from where the speed improvement would come. Did you have significant gains? – Valentin_Ștefan Aug 31 '17 at 12:18
  • @Valentine, I believe the second answer that uses cut wasn't there when I referenced that page. – kennyB Sep 05 '17 at 22:33