Split a vector into chunks

Question

I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. Here is what I came up with so far;

x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3

$`1`
[1] 4 5 6 7

$`2`
[1]  8  9 10

Yes, it's very unclear that what you get is the solution to "n chunks of equal size". But maybe this gets you there too: x <- 1:10; n <- 3; split(x, cut(x, n, labels = FALSE)) — mdsumner, Jul 23 '10 at 14:08
both the solution in the question, and the solution in the preceding comment are incorrect, in that they might not work, if the vector has repeated entries. Try this: > foo <- c(rep(1, 12), rep(2,3), rep(3,3)) [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 > chunk(foo, 2) (gives wrong result) > chunk(foo, 3) (also wrong) — mathheadinclouds, Apr 29 '13 at 09:21
(continuing preceding comment) why? rank(x) doesn't need to be an integer > rank(c(1,1,2,3)) [1] 1.5 1.5 3.0 4.0 so that's why the method in the question fails. this one works (thanks to Harlan below) > chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) — mathheadinclouds, Apr 29 '13 at 09:33
As @mathheadinclouds suggests, the example data is a very special case. Examples that are more general would be more useful and better tests. E.g. `x <- c(NA, 4, 3, NA, NA, 2, 1, 1, NA ); y <- letters[x]; z <- factor(y)` gives examples with missing data, repeated values, that are not already sorted, and are in different classes (integer, character, factor). — Kalin, Feb 21 '18 at 17:39
Additionally, the distribution of values in the original is even, so a more general example would load up on values that might be put in one "bin" in case of solutions that rely on distribution cuts. How about this? (fixes the length to be 10 again) `x <- c(NA, 4, 2, NA, NA, 1, 1, 1, 3, NA ); y <- letters[x]; z <- factor(y)` — Kalin, Feb 21 '18 at 17:44
Why make it so complicated with the `factor(sort(rank()))`? This generates unequal chunks if values in the vector are repeated. Why not just `split(x, factor(1:length(x)%%n))`? — sharchaea, Nov 06 '20 at 09:10

score 391 · Answer 1 · edited Dec 23 '13 at 18:41

391

A one-liner splitting d into chunks of size 20:

split(d, ceiling(seq_along(d)/20))

More details: I think all you need is seq_along(), split() and ceiling():

> d <- rpois(73,5)
> d
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2  3  8  3 10  7  4
[27]  3  4  4  1  1  7  2  4  6  0  5  7  4  6  8  4  7 12  4  6  8  4  2  7  6  5
[53]  4  5  4  5  5  8  7  7  7  6  2  4  3  3  8 11  6  6  1  8  4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2

$`2`
 [1]  3  8  3 10  7  4  3  4  4  1  1  7  2  4  6  0  5  7  4  6

$`3`
 [1]  8  4  7 12  4  6  8  4  2  7  6  5  4  5  4  5  5  8  7  7

$`4`
 [1]  7  6  2  4  3  3  8 11  6  6  1  8  4

edited Dec 23 '13 at 18:41

dfrankow

20,191
41
152
214

answered Jul 23 '10 at 19:22

Harlan

18,883
8
47
56

48

The question asks for `n` chunks of equal size. This gets you an unknown number of chunks of size `n`. I had the same problem and used the solutions from @mathheadinclouds. – rrs Apr 21 '14 at 18:26
5

As one can see from the output of d1, this answer does not split d into groups of equal size (4 is obviously shorter). Thus it does not answer the question. – Calimo Jan 23 '15 at 16:39
9

@rrs : split(d, ceiling(seq_along(d)/(length(d)/n))) – gkcn Jun 05 '15 at 11:45
I know this is quite old but it may be of help to those who stumble here. Although the OP's question was to split into chunks of equal size, if the vector happens not to be a multiple of the divisor, the last chink will have a different size than chunk. To split into `n-chunks` I used `max <- length(d)%/%n`. I used this with a vector of 31 strings and obtained a list of 3 vectors of 10 sentences and one of 1 sentence. – salvu Feb 04 '17 at 12:59
@Harlan Is there a way to shuffle the split as well? your solution worked well for me but I would like to make sure the splits are randomly assigned and not just consecutive – Spooked Oct 21 '20 at 23:22

score 113 · Answer 2 · edited Apr 29 '13 at 10:10

113

chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))

edited Apr 29 '13 at 10:10

Dis Shishkov

657
7
21

answered Apr 29 '13 at 09:37

mathheadinclouds

3,507
2
27
37

This is the fastest way I've tried so far! Setting `labels = FALSE` speed up twice, and using `cut()` is 4 times faster than using `ceiling(seq_along(x) / n` on my data. – Drumy Oct 21 '20 at 06:25
1

Correction: this is the fastest among the `split()` approaches. @verbarmour's answer below is the fastest overall. It is blazing fast because it doesn't have to work with factor, nor does it need to sort. That answer deserves a lot more upvotes. – Drumy Oct 21 '20 at 07:05

score 54 · Answer 3 · edited Sep 07 '21 at 11:31

54

A simplified version:

n = 3
split(x, sort(x%%n))

NB: This will only work on numeric vectors.

edited Sep 07 '21 at 11:31

andschar

3,504
2
27
35

answered Apr 20 '16 at 21:03

zhan2383

669
5
9

I like this as it gives you chunks that are as equally sized as possible (good for dividing up large task e.g. to accommodate limited RAM or to run a task across multiple threads). – alexvpickering Jul 21 '16 at 22:13
7

This is useful, but keep in mind this will only work on numeric vectors. – Keith Hughitt Aug 24 '16 at 17:49
@KeithHughitt this can be solved with factors and returning the levels as numeric. Or at least this is how I implemented it. – drmariod Apr 05 '18 at 07:02
2

@drmariod can also be extended by doing `split(x, sort(1:length(x) %% n))` – Richard DiSalvo Sep 14 '20 at 19:28
@RichardDiSalvo is there a faster way to implement this on objects with very high nrow? – Jessica Burnett Dec 13 '21 at 16:14
1

@JessicaBurnett I think `split()` is the slowest part of this code (because it calls `as.factor`). So maybe consider using a data.frame and do something like `data$group <- sort(1:length(data) %% n)`, then use the group column in the rest of your code. – Richard DiSalvo Dec 14 '21 at 19:40

score 26 · Answer 4 · answered Nov 01 '18 at 04:47

Using base R's rep_len:

x <- 1:10
n <- 3

split(x, rep_len(1:n, length(x)))
# $`1`
# [1]  1  4  7 10
# 
# $`2`
# [1] 2 5 8
# 
# $`3`
# [1] 3 6 9

And as already mentioned if you want sorted indices, simply:

split(x, sort(rep_len(1:n, length(x))))
# $`1`
# [1] 1 2 3 4
# 
# $`2`
# [1] 5 6 7
# 
# $`3`
# [1]  8  9 10

score 23 · Answer 5 · edited Jan 12 '17 at 02:01

23

Try the ggplot2 function, cut_number:

library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#>  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]

# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#> 
#> $`(4,7]`
#> [1] 5 6 7
#> 
#> $`(7,10]`
#> [1]  8  9 10

edited Jan 12 '17 at 02:01

Sam Firke

21,571
9
87
105

answered Jan 09 '15 at 13:41

Scott Worland

1,352
1
12
15

2

This does not work for splitting up the `x`, `y`, or `z` defined in [this comment](https://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r#comment84830680_3318333). In particular, it sorts the results, which may or may not be okay, depending on the application. – Kalin Feb 21 '18 at 17:42
Rather, [this comment](https://stackoverflow.com/questions/3318333/split-a-vector-into-chunks-in-r#comment84830878_3318333). – Kalin Feb 21 '18 at 17:48

score 20 · Answer 6 · answered Dec 23 '14 at 18:26

20

If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:

chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))

Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.

answered Dec 23 '14 at 18:26

verbamour

945
9
16

1

This is blazing fast! – Drumy Oct 21 '20 at 07:03
1

This also does chunks of size n rather than n chunks. – nelliott Dec 08 '21 at 00:41
1

Just what I needed to prevent an "out of memory" error. Thanks! – Jeff Feb 28 '23 at 16:25
Great. Is there a way to quickly set this up to make each group randomized each time? – theforestecologist Aug 28 '23 at 18:46

score 19 · Answer 7 · edited Sep 29 '20 at 16:13

This will split it differently to what you have, but is still quite a nice list structure I think:

chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) { 
  if(force.number.of.groups) {
    f1 <- as.character(sort(rep(1:n, groups)))
    f <- as.character(c(f1, rep(n, overflow)))
  } else {
    f1 <- as.character(sort(rep(1:groups, n)))
    f <- as.character(c(f1, rep("overflow", overflow)))
  }
  
  g <- split(x, f)
  
  if(force.number.of.groups) {
    g.names <- names(g)
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
  } else {
    g.names <- names(g[-length(g)])
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
    g.names.ordered <- c(g.names.ordered, "overflow")
  }
  
  return(g[g.names.ordered])
}

Which will give you the following, depending on how you want it formatted:

> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1] 7 8 9

$overflow
[1] 10

> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1]  7  8  9 10

Running a couple of timings using these settings:

set.seed(42)
x <- rnorm(1:1e7)
n <- 3

Then we have the following results:

> system.time(chunk(x, n)) # your function 
   user  system elapsed 
 29.500   0.620  30.125 

> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
   user  system elapsed 
  5.360   0.300   5.663

Note: Changing as.factor() to as.character() made my function twice as fast.

score 13 · Answer 8 · answered Jul 23 '10 at 14:38

A few more variants to the pile...

> x <- 1:10
> n <- 3

Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:

> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1]  8  9 10

Or you can assign character indices, vice the numbers in left ticks above:

> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1]  8  9 10

Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:

> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1]  7  8  9 10

score 10 · Answer 9 · answered Jul 23 '10 at 14:22

10

You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:

split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))

This gives the same result for your example, but not for skewed variables.

answered Jul 23 '10 at 14:22

SiggyF

22,088
8
43
57

Matifou · Answer 10 · 2022-08-02T12:07:22.983

10

Yet another possibility is the splitIndices function from package parallel:

library(parallel)
splitIndices(20, 3)

Gives:

[[1]]
[1] 1 2 3 4 5 6 7

[[2]]
[1]  8  9 10 11 12 13

[[3]]
[1] 14 15 16 17 18 19 20

NB: this works only with numeric values though. If you want to split a character vector, you would need to do some indexing: lapply(splitIndices(20, 3), \(x) letters[1:20][x])

edited Aug 02 '22 at 12:07

answered Sep 10 '18 at 21:31

Matifou

7,968
3
47
52

Only works with numeric values – Julien Jul 31 '22 at 15:48

score 7 · Answer 11 · edited Sep 14 '13 at 23:08

Here's another variant.

NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter

all chunks are uniform, except for the last;
the last will at worst be smaller, never bigger than the chunk size.

chunk <- function(x,n)
{
    f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
    return(split(x,f))
}

#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)

c<-chunk(n,5)

q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|

frankc · Answer 12 · 2010-07-23T18:10:28.917

7

split(x,matrix(1:n,n,length(x))[1:length(x)])

perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))

if you want it ordered,throw a sort around it

edited Jul 23 '10 at 18:10

answered Jul 23 '10 at 16:30

frankc

11,290
4
32
49

score 6 · Answer 13 · answered Jun 23 '13 at 07:41

I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):

chunk <- function(x,n){
  numOfVectors <- floor(length(x)/n)
  elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
  elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
  split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538  0.1836433 -0.8356286

$`2`
[1]  1.5952808  0.3295078 -0.8204684

$`3`
[1]  0.4874291  0.7383247  0.5757814 -0.3053884

score 5 · Answer 14 · answered Feb 08 '18 at 14:30

Simple function for splitting a vector by simply using indexes - no need to over complicate this

vsplit <- function(v, n) {
    l = length(v)
    r = l/n
    return(lapply(1:n, function(i) {
        s = max(1, round(r*(i-1))+1)
        e = min(l, round(r*i))
        return(v[s:e])
    }))
}

score 3 · Answer 15 · answered Aug 21 '18 at 13:29

3

Sorry if this answer comes so late, but maybe it can be useful for someone else. Actually there is a very useful solution to this problem, explained at the end of ?split.

> testVector <- c(1:10) #I want to divide it into 5 parts
> VectorList <- split(testVector, 1:5)
> VectorList
$`1`
[1] 1 6

$`2`
[1] 2 7

$`3`
[1] 3 8

$`4`
[1] 4 9

$`5`
[1]  5 10

answered Aug 21 '18 at 13:29

Laura Paladini

93
1
9

3

this will break if there are unequal number of values in each group! – Matifou Sep 10 '18 at 21:31

score 2 · Answer 16 · edited May 23 '17 at 12:02

2

Credit to @Sebastian for this function

chunk <- function(x,y){
         split(x, factor(sort(rank(row.names(x))%%y)))
         }

edited May 23 '17 at 12:02

Community

1
1

answered Dec 05 '14 at 15:24

score 2 · Answer 17 · answered Dec 23 '14 at 17:42

If you don't like split() and you don't mind NAs padding out your short tail:

chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }

The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.

score 2 · Answer 18 · answered Mar 26 '17 at 21:24

I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:

library(data.table)    
split_dt <- function(x,y) 
    {
    for(i in seq(from=1,to=nrow(get(x)),by=y)) 
        {df_ <<- get(x)[i:(i + y)];
            assign(paste0("df_",i),df_,inherits=TRUE)}
    rm(df_,inherits=TRUE)
    }

This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.

score 1 · Answer 19 · edited Sep 29 '20 at 16:14

1

I have come up with this solution:

require(magrittr)
create.chunks <- function(x, elements.per.chunk){
    # plain R version
    # split(x, rep(seq_along(x), each = elements.per.chunk)[seq_along(x)])
    # magrittr version - because that's what people use now
    x %>% seq_along %>% rep(., each = elements.per.chunk) %>% extract(seq_along(x)) %>% split(x, .) 
}
create.chunks(letters[1:10], 3)
$`1`
[1] "a" "b" "c"

$`2`
[1] "d" "e" "f"

$`3`
[1] "g" "h" "i"

$`4`
[1] "j"

The key is to use the seq(each = chunk.size) parameter so make it work. Using seq_along acts like rank(x) in my previous solution, but is actually able to produce the correct result with duplicated entries.

edited Sep 29 '20 at 16:14

M--

25,431
8
61
93

answered Sep 19 '18 at 11:08

Sebastian

3,679
3
19
14

1

For those concerned that rep(seq_along(x), each = elements.per.chunk) might be too straining on the memory: yes it does. You could try a modified version of my previous suggestion: chunk <- function(x,n) split(x, factor(seq_along(x)%%n)) – Sebastian Sep 19 '18 at 11:13
For me, it produces the following error: `no applicable method for 'extract_' applied to an object of class "c('integer', 'numeric')` – sharchaea Nov 06 '20 at 09:02

score 0 · Answer 20 · answered Oct 05 '21 at 10:38

Here's yet another one, allowing you to control if you want the result ordered or not:

split_to_chunks <- function(x, n, keep.order=TRUE){
  if(keep.order){
    return(split(x, sort(rep(1:n, length.out = length(x)))))
  }else{
    return(split(x, rep(1:n, length.out = length(x))))
  }
}

split_to_chunks(x = 1:11, n = 3)
$`1`
[1] 1 2 3 4

$`2`
[1] 5 6 7 8

$`3`
[1]  9 10 11

split_to_chunks(x = 1:11, n = 3, keep.order=FALSE)

$`1`
[1]  1  4  7 10

$`2`
[1]  2  5  8 11

$`3`
[1] 3 6 9

score 0 · Answer 21 · answered Oct 25 '22 at 23:08

0

Not sure if this answers OP's question, but I think the %% can be useful here

df # some data.frame
N_CHUNKS <- 10
I_VEC <- 1:nrow(df)
df_split <- split(df, sort(I_VEC %% N_CHUNKS))

answered Oct 25 '22 at 23:08

cmilando

46
4

score -1 · Answer 22 · answered Jul 31 '19 at 08:27

This splits into chunks of size ⌊n/k⌋+1 or ⌊n/k⌋ and does not use the O(n log n) sort.

get_chunk_id<-function(n, k){
    r <- n %% k
    s <- n %/% k
    i<-seq_len(n)
    1 + ifelse (i <= r * (s+1), (i-1) %/% (s+1), r + ((i - r * (s+1)-1) %/% s))
}

split(1:10, get_chunk_id(10,3))

Split a vector into chunks

22 Answers22

Linked

Related