3

I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.

My data looks something like this

ID  X   Y   Z
1   A   A   1
2   A   B   5
3   A   C   2
4   A   D   42
5   A   E   10
6   A   F   2
7   A   G   0
8   A   H   3
9   A   I   0
10  A   J   8
11  A   K   19
12  A   L   3
13  A   M   1
14  A   N   1
15  A   O   2
16  A   P   0
17  A   Q   1
18  A   R   2

What is want is something like this

ID  X   Y   Z   CumSum  Group
1   A   A   1   1   1
2   A   B   5   6   1
3   A   C   2   8   1
4   A   D   42  50  1
5   A   E   10  10  2
6   A   F   2   12  2
7   A   G   0   12  2
8   A   H   3   15  2
9   A   I   0   15  2
10  A   J   8   23  2
11  A   K   19  42  2
12  A   L   3   45  2
13  A   M   1   1   3
14  A   N   1   2   3
15  A   O   2   4   3   
16  A   P   0   4   3
17  A   Q   1   5   3
18  A   R   2   7   3

Please let me know how I can achieve this with R.

EDIT: I extended the minimum reproducible example for more clarity

EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?

unaeem
  • 441
  • 2
  • 11
  • 3
    [Relevant](https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset) – Sotos Jun 24 '19 at 13:00

5 Answers5

4

A function for this is available in the MESS-package...

library(MESS)
library(data.table)

DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][]

output

    ID X Y  Z Group Cumsum
 1:  1 A A  1     1      1
 2:  2 A B  5     1      6
 3:  3 A C  2     1      8
 4:  4 A D 42     1     50
 5:  5 A E 10     2     10
 6:  6 A F  2     2     12
 7:  7 A G  0     2     12
 8:  8 A H  3     2     15
 9:  9 A I  0     2     15
10: 10 A J  8     2     23
11: 11 A K 19     2     42
12: 12 A L  3     2     45

sample data

DT <- fread("ID  X   Y   Z
            1   A   A   1
            2   A   B   5
            3   A   C   2
            4   A   D   42
            5   A   E   10
            6   A   F   2
            7   A   G   0
            8   A   H   3
            9   A   I   0
            10  A   J   8
            11  A   K   19
            12  A   L   3")
Wimpel
  • 26,031
  • 1
  • 20
  • 37
3

Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used.

Accum <- function(acc, x) if (acc < 45)  acc + x else x
applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE)
cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum
GroupNo <- function(x) {
  y <- cumsumr(x >= 45)
  max(y) - y + 1
}
transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)), 
  Group = ave(Cumsum, ID, FUN = GroupNo))

giving:

   ID X Y  Z Cumsum Group
1   1 A A  1      1     1
2   2 A B  5      6     1
3   3 A C  2      8     1
4   4 A D 42     50     1
5   5 A E 10     10     2
6   6 A F  2     12     2
7   7 A G  0     12     2
8   8 A H  3     15     2
9   9 A I  0     15     2
10 10 A J  8     23     2
11 11 A K 19     42     2
12 12 A L  3     45     2
13 12 A L  3      3     3
14 12 A L  3      6     3

Note

The input in reproducible form:

Lines <- "ID  X   Y   Z
1   A   A   1
2   A   B   5
3   A   C   2
4   A   D   42
5   A   E   10
6   A   F   2
7   A   G   0
8   A   H   3
9   A   I   0
10  A   J   8
11  A   K   19
12  A   L   3
12  A   L   3
12  A   L   3"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hello. Thank you for the answer. However, I just checked again by adding more rows and it is quite not working. It is still assigning Group = 2 if I add a 13th row or 14th row to the data. After row 12, when CumSum = 45, it should move to assign the next group as Group = 3. Please have a look. – unaeem Jun 25 '19 at 05:42
  • Have fixed the code and modified the example to duplicate the last row several times so that there are 3 groups. – G. Grothendieck Jun 25 '19 at 15:12
  • Hello. I am very thankful for your response. This code is working perfectly now. What if, the variable `X` which is `A` only right now is also changing. For example, it can be `B` for a while then can go to being `C`. How can I prevent the code from generating groups that are not within two categories of `X`. For example if `Group = 3`, then how can I make sure that 3 is not in category `A` and `B`? – unaeem Jun 26 '19 at 07:14
  • I also asked it in a separate thread. [Thread Link](https://stackoverflow.com/questions/56768930/r-creating-groups-based-on-running-totals-that-are-non-repeating) – unaeem Jun 26 '19 at 08:58
2

One tidyverse possibility could be:

df %>% 
 mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
        Group = cumsum(Cumsum >= 45),
        Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)

   ID X Y  Z Cumsum Group
1   1 A A  1      1     1
2   2 A B  5      6     1
3   3 A C  2      8     1
4   4 A D 42     50     1
5   5 A E 10     10     2
6   6 A F  2     12     2
7   7 A G  0     12     2
8   8 A H  3     15     2
9   9 A I  0     15     2
10 10 A J  8     23     2
11 11 A K 19     42     2
12 12 A L  3     45     2
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
1

Not a pretty solution, but functional.

df$Group<-0
group<-1
while (df$Group[nrow(df)]==0) {
  df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0])
  df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group
  group=group+1
}

df
   ID X Y  Z ww Group
1   1 A A  1  1  1
2   2 A B  5  6  1
3   3 A C  2  8  1
4   4 A D 42 50  1
5   5 A E 10 10  2
6   6 A F  2 12  2
7   7 A G  0 12  2
8   8 A H  3 15  2
9   9 A I  0 15  2
10 10 A J  8 23  2
11 11 A K 19 42  2
12 12 A L  3 45  2

OK, yeah, @tmfmnk 's solution is vastly better:

Unit: milliseconds
 expr       min        lq     mean    median        uq      max neval
   tm  2.224536  2.805771  6.76661  3.221449  3.990778 303.7623   100
  iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062   100
iod
  • 7,412
  • 2
  • 17
  • 36
0

Or using data.table:

library(data.table)
n <- 45L
DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][, 
    Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)]

output:

    ID X Y  Z cs Group
 1:  1 A A  1  1     1
 2:  2 A B  5  6     1
 3:  3 A C  2  8     1
 4:  4 A D 42 42     1
 5:  5 A E 10 10     2
 6:  6 A F  2 12     2
 7:  7 A G  0 12     2
 8:  8 A H  3 15     2
 9:  9 A I  0 15     2
10: 10 A J  8 23     2
11: 11 A K 19 42     2
12: 12 A L  3 45     2
13: 13 A M  1  1     3
14: 14 A N  1  2     3
15: 15 A O  2  4     3
16: 16 A P  0  4     3
17: 17 A Q  1  5     3
18: 18 A R  2  7     3

data:

library(data.table)
DT <- fread("ID  X   Y   Z
1   A   A   1
2   A   B   5
3   A   C   2
4   A   D   42
5   A   E   10
6   A   F   2
7   A   G   0
8   A   H   3
9   A   I   0
10  A   J   8
11  A   K   19
12  A   L   3
13  A   M   1
14  A   N   1
15  A   O   2
16  A   P   0
17  A   Q   1
18  A   R   2")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35