Group function with basic calculation

Question

I have a data.table with two parameters(date and status), now I want to insert new columns based on the original table.

data rules:

the Status column contains only "0" and "1"
the Date column is always increase by seconds :)

new variables:

group: to number each group or cycle for the status, the order of the status is (0,1). it means that the status starts with status '0', when the status becomes '0' again, one cycle is completed.
cycle_time: calculate the cycle time for each group
group_0: calculate the time for the status 0 within a specific group
group_1: calculate the time for the status 1 within a specific group

For example, a simple input:

the code to generate the data:

 dd <- data.table(date = c("2015-07-01 00:00:12", "2015-07-01 00:00:13","2015-07-01 00:00:14","2015-07-01 00:00:15", "2015-07-01 00:00:16", "2015-07-01 00:00:17","2015-07-01 00:00:18", "2015-07-01 00:00:19", "2015-07-01 00:00:20","2015-07-01 00:00:21", "2015-07-01 00:00:22", "2015-07-01 00:00:23","2015-07-01 00:00:24", "2015-07-01 00:00:25"), status = c(0,0,0,0,1,1,1,0,0,1,1,1,1,0))

the output including new parameters is:

actually i have done with some basic methods,

the main idea is :if the current status is 0 and the next status is 1, then mark it as one group.
the idea could work, but the problem is the calculation time is too long, since so many loops.

I supposed that there could be an easier solution for this case

[How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) — zx8754, Oct 25 '16 at 09:20

score 2 · Accepted Answer · edited Oct 25 '16 at 10:08

So a transition from 1 to 0 marks the boundary of a group. You can use cumsum and diff to get this working. For the x example in the answer of @zx8754:

data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
  x group_id
1 0        1
2 0        1
3 0        1
4 1        1
5 1        1
6 0        2
7 0        2
8 1        2
9 0        3

For a more realistically sized example:

res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
   user  system elapsed 
  2.770   1.680   4.449 
>     head(res, 20)
   status group_id
1       0        1
2       0        1
3       1        1
4       0        2
5       0        2
6       0        2
7       1        2
8       1        2
9       0        3
10      1        3
11      1        3
12      0        4
13      1        4
14      0        5
15      0        5
16      1        5
17      0        6
18      0        6
19      1        6
20      0        7

5 seconds for 10 million records is quite fast (although that depends on your definition of fast :)).

Benchmarking

set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))

microbenchmark::microbenchmark(
  rleid = {
    gr <- data.table::rleid(res$status)
    x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
    # removing "as.numeric(as.factor" helps, but still not as fast as cumsum
    #x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
  },
  cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)

# Unit: milliseconds
#    expr        min         lq       mean     median         uq       max neval cld
#   rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777   100   b
# cumsum   1.511811   1.559563   2.221273   1.826404   2.475402   6.88169   100  a 

identical(x1, x2)
# [1] TRUE

I think group_id should stay the same until the next 0, i.e.: `0,0,1,0,1` should be `1,1,1,2,2`. — zx8754, Oct 25 '16 at 09:09
hi thanks for the idea, but have a look at the definition of group in this case. Based on your data, the rows from 1 to 3 are in the same group one, and the rows of four and five is the second group :) — ZAWD, Oct 25 '16 at 09:09
@zx8754 thanks for the tip, so only a transition from `1` to `0` marks a new group. I'll edit my answer. — Paul Hiemstra, Oct 25 '16 at 09:31
@zx8754 I came up with a very fast vectorized approach using `diff` and `cumsum`. — Paul Hiemstra, Oct 25 '16 at 09:42
@zx8754 thanks for the benchmarks, as expected the vectorized approach is very fast. — Paul Hiemstra, Oct 25 '16 at 10:19

zx8754 · Answer 2 · 2016-10-25T09:18:16.953

2

Try this:

#dummy data
x <- c(0,0,0,1,1,0,0,1,0)

#get group id using rleid from data.table
gr <- data.table::rleid(x)

#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)

#result
cbind(x, gr)
#      x gr
# [1,] 0  1
# [2,] 0  1
# [3,] 0  1
# [4,] 1  1
# [5,] 1  1
# [6,] 0  3
# [7,] 0  3
# [8,] 1  3
# [9,] 0  5

#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
#      x gr  
# [1,] 0  1
# [2,] 0  1
# [3,] 0  1
# [4,] 1  1
# [5,] 1  1
# [6,] 0  2
# [7,] 0  2
# [8,] 1  2
# [9,] 0  3

edited Oct 25 '16 at 09:18

answered Oct 25 '16 at 09:13

zx8754

52,746
12
114
209

1

+1, although my solution above using `diff` and `cumsum` is much shorter in code, and probably faster (although I have not tested this)? – Paul Hiemstra Oct 25 '16 at 09:43
hi, thanks for the solution, but still faster than my solution :) thank u ! – ZAWD Oct 25 '16 at 20:28

Group function with basic calculation

2 Answers2