3

I have seen many variations on the "split vector X into Y chunks in R" question on here. See for example: here and here for just two. So, when I realized I needed to split a vector into Y chunks of random size, I was surprised to find that the randomness requirement might be "new"--I couldn't find a way to do this on here.

So, here's what I've drawn up:

k.chunks = function(seq.size, n.chunks) {
  break.pts = sample(1:seq.size, n.chunks, replace=F) %>% sort() #Get a set of break points chosen from along the length of the vector without replacement so no duplicate selections.
  groups = rep(NA, seq.size) #Set up the empty output vector.
  groups[1:break.pts[1]] = 1 #Set the first set of group affiliations because it has a unique start point of 1.

for (i in 2:(n.chunks)) { #For all other chunks...
    groups[break.pts[i-1]:break.pts[i]] = i #Set the respective group affiliations
    }
    groups[break.pts[n.chunks]:seq.size] = n.chunks #Set the last group affiliation because it has a unique endpoint of seq.size.
    return(groups)
    }

My question is: Is this inelegant or inefficient somehow? It will get called 1000s of times in the code I plan to do, so efficiency is important to me. It'd be especially nice to avoid the for loop or having to set both the first and last groups "manually." My other question: Are there logical inputs that could break this? I recognize that n.chunks cannot > seq.size, so I mean other than that.

Bajcz
  • 433
  • 5
  • 20
  • 1
    What are the typical sizes that you need to handle here? It seems for a not too large `n.chunks`, your code should be fast enough. – F. Privé Aug 31 '21 at 19:58
  • 1
    Also, there are two things odd in your code. You're reassigning the last element of each group to the next group (`break.pts[i-1]:break.pts[i]`), and the last group as the same assignment as the one before. – F. Privé Aug 31 '21 at 20:08
  • For random Y, wouldn't you be in `sort(sample(1:length(vector), sample(1:length(n.chunks( ie, vector again), replace = FALSE), replace= FALSE))`, unless there is a minimum chunk size, at which point you could toss in a `seq`? – Chris Aug 31 '21 at 21:30
  • @Chris I think the `sort` will kill your execution time. The simplest code I can think of is `sort(sample(1:n.chunks, seq.size, replace = TRUE))`. But this ends up being rather slow (relatively) actually. –  Aug 31 '21 at 21:38
  • 1
    @Adam, I take your point and admire your code below. My sort wrap was just following the OP above, but which is random if not both was more my question as randomizing start/end for ranges seems a pretty tricky problem. – Chris Aug 31 '21 at 21:46
  • @F. Prive. I would say n.chunks is going to be less than 100 and seq.size will be between 100 and 10000. Also, please go ahead and edit my question if my code is doing the indexing wrong--I wouldn't be surprised if it is! – Bajcz Sep 01 '21 at 12:54

2 Answers2

2

That should be pretty quick for smaller numbers. But here a more concise way.

k.chunks2 = function(seq.size, n.chunks) {
  break.pts <- sort(sample(1:seq.size, n.chunks - 1, replace = FALSE))
  break.len <- diff(c(0, break.pts, seq.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

If you really get a huge number of groups, I think the sort will start to cost you execution time. So you can do something like this (probably can be tweaked to be even faster) to split based on proportions. I am not sure how I feel about this, because as n.chunks gets very large, the proportions will get very small. But it is faster.

k.chunks3 = function(seq.size, n.chunks) {
  props <- runif(n.chunks)
  grp.props <- props / sum(props)
  
  chunk.size <- floor(grp.props[-n.chunks] * seq.size)
  break.len <- c(chunk.size, seq.size - sum(chunk.size))
  
  groups <- rep(1:n.chunks, times = break.len)
  return(groups)
}

Running a benchmark, I think any of these will be fast enough (unit is microseconds).

n <- 1000
y <- 10

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: microseconds
            expr  min    lq   mean median    uq   max neval
  k.chunks(n, y) 49.9 52.05 59.613  53.45 58.35 251.7   100
 k.chunks2(n, y) 46.1 47.75 51.617  49.25 52.55 107.1   100
 k.chunks3(n, y)  8.1  9.35 11.412  10.80 11.75  44.2   100

But as the numbers get larger, you will notice a meaningful speedup (note the unit is now milliseconds).

n <- 1000000
y <- 100000

microbenchmark::microbenchmark(k.chunks(n, y),
                               k.chunks2(n, y),
                               k.chunks3(n, y))

Unit: milliseconds
            expr     min       lq     mean   median       uq      max neval
  k.chunks(n, y) 46.9910 51.38385 57.83917 54.54310 56.59285 113.5038   100
 k.chunks2(n, y) 17.2184 19.45505 22.72060 20.74595 22.73510  69.5639   100
 k.chunks3(n, y)  7.7354  8.62715 10.32754  9.07045 10.44675  58.2093   100

All said and done, I would probably use my k.chunks2() function.

  • Nice answer! I didn't know about `diff()`, or at least its potential usage in this context. I have three questions. First, the outputs I get for your two approaches are different in structure--the latter always seems to produce more normally distributed group lengths than the former (the former is what I prefer because chaos is good in my context). Second, is there a way to do this using `rle()` somehow? I puzzled over that a while. Third, did you avoid indexing for a reason? Is indexing generally slow? – Bajcz Sep 01 '21 at 13:03
  • 1
    Yes, the two would have different structure. Honestly, I did not play with it enough to really observe the behaviors. So grab the one you like! You probably could use `rle()`, but I am not sure what you would gain. Essentially, it would be a replacement for `rep()`, which should be a fast primative. And third, I am not sure exactly what you are referring to as indexing in this context, but in this case, I viewed the problem as coming up with the break lengths. Once you have that, it seemed just natural to expand with `rep()`. So that was the cleanest way I thought of. –  Sep 01 '21 at 13:41
  • Mine uses `[` a lot but yours doesn't--that's what I was referring to as indexing. I think I remember reading that that is a relatively slow function compared to others? – Bajcz Sep 01 '21 at 13:57
  • 1
    Ah, gotcha. Honestly, the for loop and doing that isn't really that bad. But in this case, I saw a clear way to vectorize the output, so I did it. It generally can be faster, but also just makes for more readable code to me. –  Sep 01 '21 at 14:00
0

Random is probably inefficient, but it would seem to be expected that it should be so. Random suggests all input elements should also be random. So, considering a desired random selection from a vector Y; it would seem the effort should be applied to an index of Y, and successive Y(s), that would be or seem random. With sufficient sets of Y(s) it can be discerned how far from completely random the indexing is, but maybe that is not material, or perhaps merely thousands of repetitions is insufficient to demonstrate it.

None the less, my sense is that both inputs to sample need to be 'random' in some way as a certainty in one reduces the randomness of the other.

my_vector <- c(1:100000) 
sample_1 <- sample(my_vector, 50, replace = FALSE)
sample_2 <- sample(my_vector, 80, replace = FALSE)
full_range <- c(1, sort(unique(sample1,sample2)), 100000)
starts <- full_range[c(TRUE,FALSE)]#[generally](https://stackoverflow.com/questions/33257610/how-to-return-the-elements-in-the-odd-position)
ends <- full_range[c(FALSE, TRUE)]
!unique(diff(full_range))

And absent setting seed, I think non-reproducible is as close as you get to a random selection upon Y(s). This answer is just to suggest an approach to indexing Y. The use of indices thereafter might follow @Adam 's approach. And, of course, I could be completely wrong about all of this. Clearer random thinkers than me might well weigh in...

Chris
  • 1,647
  • 1
  • 18
  • 25
  • I'm very sorry but I have to admit to not following you here. Are you saying that what I'm achieving in my code isn't really "random" in some sense? – Bajcz Sep 01 '21 at 13:05
  • I'm sorry to say it is an intuition that something like the above might address and may helpfully add to your sense of chaos. – Chris Sep 01 '21 at 13:11