Quickly split a large vector into chunks in R

Question

My question is extremely closely related to this one:

I'm trying to split a large vector into known chunk sizes and it's slow. A solution for vectors with even remainders is here:

A quick solution when a factor exists is here:

Split dataframe into equal parts based on length of the dataframe

I would like to handle the case of no (large) factor existing, as I would like fairly large chunks.

My example for a vector much smaller than the one in my real life application:

d <- 1:6510321
# Sloooow
chunks <- split(d, ceiling(seq_along(d)/2000))

That takes me 5 seconds on my modest work machine here. How fast do you need when processing 6.5M cases? I'm serious - I understand this could be a pain if you're doing it a lot in a function. — thelatemail, Jun 25 '15 at 23:33
Sorry I didn't want my example to take too long. The actual application is 100 times bigger. So quickly would be great! Thanks — kennyB, Jun 25 '15 at 23:35
Righty-o then, that changes things a bit. So we're talking a vector of length 600M? — thelatemail, Jun 25 '15 at 23:46
Is the order important? Replacing the second argument with 1:(length(d)/2000) will (surprisingly) speed things up by almost 30x. But it will result in a different ordering, I believe. — Cliff AB, Jun 25 '15 at 23:58
@CliffAB - That code would recycle a vector of length 2000 over the length of `d` instead of needing to create a whole other vector of length `d` — thelatemail, Jun 26 '15 at 01:00

celacanto · Answer 1 · 2018-03-28T18:19:44.153

Using llply from the plyr package I was able to reduce the time.

chunks <- function(d, n){      
    chunks <- split(d, ceiling(seq_along(d)/n))
    names(chunks) <- NULL
    return(chunks)
 }

require(plyr)
plyrChunks <- function(d, n){
     is <- seq(from = 1, to = length(d), by = ceiling(n))
     if(tail(is, 1) != length(d)) {
          is <- c(is, length(d)) 
     } 
     chunks <- llply(head(seq_along(is), -1), 
                     function(i){
                         start <-  is[i];
                         end <- is[i+1]-1;
                         d[start:end]})
    lc <- length(chunks)
    td <- tail(d, 1)
    chunks[[lc]] <- c(chunks[[lc]], td)
    return(chunks)
 }

 # testing
 d <- 1:6510321
 n <- 2000

 system.time(chks <- chunks(d,n))
 #    user  system elapsed 
 #   5.472   0.000   5.472 

 system.time(plyrChks <- plyrChunks(d, n))
 #    user  system elapsed 
 #   0.068   0.000   0.065 

 identical(chks, plyrChks)
 # TRUE

You can speed even more using the .parallel parameter from the llpyr function. Or you can add a progress bar using the .progress parameter.

score 2 · Accepted Answer · answered Aug 01 '16 at 20:01

2

A speed improvement from the parallel package:

chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))

answered Aug 01 '16 at 20:01

kennyB

1,963
3
17
22

1

Hi @kennyB. Cool function you found there, so +1 for that. I had a look at the code of `splitIndices` and one can see it makes use of `split` & `cut` functions, like the solutions proposed [here](https://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r), which you also referenced. Then, not sure from where the speed improvement would come. Did you have significant gains? – Valentin_Ștefan Aug 31 '17 at 12:18
@Valentine, I believe the second answer that uses cut wasn't there when I referenced that page. – kennyB Sep 05 '17 at 22:33

Quickly split a large vector into chunks in R

2 Answers2