9

I want to group a vector based on the sum of the elements being less than or equal to n. Assume the following,

set.seed(1)
x <- sample(10, 20, replace = TRUE)
#[1]  3  4  6 10  3  9 10  7  7  1  3  2  7  4  8  5  8 10  4  8

#Where,
n = 15

The expected output would be to group values while their sum is <= 15, i.e.

y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)

As you can see the sum is never greater than 15,

sapply(split(x, y), sum)
# 1  2  3  4  5  6  7  8  9 10 
#13 13  9 10 15 12 12 13 14  8 

NOTE: I will be running this on huge datasets (usually > 150 - 200GB) so efficiency is a must.

A method that I tried and comes close but fails is,

as.integer(cut(cumsum(x), breaks = seq(0, max(cumsum(x)) + 15, 15)))
#[1] 1 1 1 2 2 3 3 4 4 4 5 5 5 6 6 6 7 8 8 8
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • 4
    Have u checked [here](https://stackoverflow.com/questions/34531568/conditional-cumsum-with-reset) and the Rcpp implementation [here](https://stackoverflow.com/questions/29054459/how-to-speed-up-or-vectorize-a-for-loop/29055443#29055443) – akrun Aug 07 '17 at 15:09
  • 3
    @akrun Thanks for the links. I will give them a read asap – Sotos Aug 07 '17 at 15:22
  • 1
    yes it's a duplicate, @akrun you had a solution here that might be generalized too: https://stackoverflow.com/questions/44512075/resetting-cumsum-if-value-goes-to-negative-in-r – moodymudskipper Aug 07 '17 at 15:34

2 Answers2

4

Here is my Rcpp-solution (close to Khashaa's solution but a bit shorter/stripped down), because you said speed was important, Rcppis probably the way to go:

# create the data
set.seed(1)
x <- sample(10, 20, replace = TRUE)
y <- c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10)

# create the Rcpp function
library(Rcpp)
cppFunction('
IntegerVector sotosGroup(NumericVector x, int cutoff) {
 IntegerVector groupVec (x.size());
 int group = 1;
 double runSum = 0;
 for (int i = 0; i < x.size(); i++) {
  runSum += x[i];
  if (runSum > cutoff) {
   group++;
   runSum = x[i];
  }
  groupVec[i] = group;
 }
 return groupVec;
}
')

# use the function as usual
y_cpp <- sotosGroup(x, 15)
sapply(split(x, y_cpp), sum)
#>  1  2  3  4  5  6  7  8  9 10 
#> 13 13  9 10 15 12 12 13 14  8


all.equal(y, y_cpp)
#> [1] TRUE

In case anyone needs to be convinced by the speed:

# Speed Benchmarks
library(data.table)
library(microbenchmark)
dt <- data.table(x)

frank <- function(DT, n = 15) {
 DT[, xc := cumsum(x)]
 b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
 z = 1; res = z
 while (!is.na(z)) 
  res <- c(res, z <- b[z])
 DT[, g := cumsum(.I %in% res)][]
}

microbenchmark(
 frank(dt),
 sotosGroup(x, 15),
 times = 100
)
#> Unit: microseconds
#>               expr      min       lq       mean    median       uq       max neval cld
#>          frank(dt) 1720.589 1831.320 2148.83096 1878.0725 1981.576 13728.830   100   b
#>  sotosGroup(x, 15)    2.595    3.962    6.47038    7.5035    8.290    11.579   100  a
David
  • 9,216
  • 4
  • 45
  • 78
3

This works, but can probably be improved:

x <- c(3L, 4L, 6L, 10L, 3L, 9L, 10L, 7L, 7L, 1L, 3L, 2L, 7L, 4L, 8L, 5L, 8L, 10L, 4L, 8L)
y <- as.integer(c(1, 1, 1, 2, 2, 3, 4, 5 ,5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10))
n = 15
library(data.table)
DT = data.table(x,y)
DT[, xc := cumsum(x)]
b = DT[.(shift(xc, fill=0) + n + 1), on=.(xc), roll=-Inf, which=TRUE]
z = 1; res = logical(length(x))
while (!is.na(z) && z <= length(x)){ 
    res[z] <- TRUE 
    z <- b[z]
}
DT[, g := cumsum(res)]
     x  y  xc  g
 1:  3  1   3  1
 2:  4  1   7  1
 3:  6  1  13  1
 4: 10  2  23  2
 5:  3  2  26  2
 6:  9  3  35  3
 7: 10  4  45  4
 8:  7  5  52  5
 9:  7  5  59  5
10:  1  5  60  5
11:  3  6  63  6
12:  2  6  65  6
13:  7  6  72  6
14:  4  7  76  7
15:  8  7  84  7
16:  5  8  89  8
17:  8  8  97  8
18: 10  9 107  9
19:  4  9 111  9
20:  8 10 119 10

DT[, all(y == g)] # TRUE

How it works

The rolling join asks "if this is the start of a group, where will the next one start?" Then you can iterate over the result, starting from the first position, to find all the groups.

The last line DT[, g := cumsum(res)] could also be done as a rolling join (maybe faster?):

DT[, g := data.table(r = which(res))[, g := .I][.(.I), on=.(r), roll=TRUE, x.g ]]
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Benchmark with my slightly edited function, of course still finding David's far faster: https://chat.stackoverflow.com/transcript/message/38542501#38542501 – Frank Aug 07 '17 at 18:29
  • Thanks Frank. I m trying to understand what's going on in there and I get stuck at `logical(length(x))`. How is length converted to logical? – Sotos Aug 08 '17 at 06:20
  • 1
    @Sotos Np. It's the same as `rep(FALSE, length(x))`; there's a set of those functions for vectors http://franknarf1.github.io/r-tutorial/_book/basics.html#initializing – Frank Aug 08 '17 at 12:15