2

I have a data.table with two parameters(date and status), now I want to insert new columns based on the original table.

data rules:

  1. the Status column contains only "0" and "1"
  2. the Date column is always increase by seconds :)

new variables:

  1. group: to number each group or cycle for the status, the order of the status is (0,1). it means that the status starts with status '0', when the status becomes '0' again, one cycle is completed.
  2. cycle_time: calculate the cycle time for each group
  3. group_0: calculate the time for the status 0 within a specific group
  4. group_1: calculate the time for the status 1 within a specific group

For example, a simple input:

enter image description here

the code to generate the data:

 dd <- data.table(date = c("2015-07-01 00:00:12", "2015-07-01 00:00:13","2015-07-01 00:00:14","2015-07-01 00:00:15", "2015-07-01 00:00:16", "2015-07-01 00:00:17","2015-07-01 00:00:18", "2015-07-01 00:00:19", "2015-07-01 00:00:20","2015-07-01 00:00:21", "2015-07-01 00:00:22", "2015-07-01 00:00:23","2015-07-01 00:00:24", "2015-07-01 00:00:25"), status = c(0,0,0,0,1,1,1,0,0,1,1,1,1,0))

the output including new parameters is:

enter image description here

actually i have done with some basic methods,

  1. the main idea is :if the current status is 0 and the next status is 1, then mark it as one group.
  2. the idea could work, but the problem is the calculation time is too long, since so many loops.

I supposed that there could be an easier solution for this case

ekad
  • 14,436
  • 26
  • 44
  • 46
ZAWD
  • 651
  • 7
  • 31

2 Answers2

2

So a transition from 1 to 0 marks the boundary of a group. You can use cumsum and diff to get this working. For the x example in the answer of @zx8754:

data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
  x group_id
1 0        1
2 0        1
3 0        1
4 1        1
5 1        1
6 0        2
7 0        2
8 1        2
9 0        3

For a more realistically sized example:

res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
   user  system elapsed 
  2.770   1.680   4.449 
>     head(res, 20)
   status group_id
1       0        1
2       0        1
3       1        1
4       0        2
5       0        2
6       0        2
7       1        2
8       1        2
9       0        3
10      1        3
11      1        3
12      0        4
13      1        4
14      0        5
15      0        5
16      1        5
17      0        6
18      0        6
19      1        6
20      0        7

5 seconds for 10 million records is quite fast (although that depends on your definition of fast :)).


Benchmarking

set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))

microbenchmark::microbenchmark(
  rleid = {
    gr <- data.table::rleid(res$status)
    x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
    # removing "as.numeric(as.factor" helps, but still not as fast as cumsum
    #x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
  },
  cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)

# Unit: milliseconds
#    expr        min         lq       mean     median         uq       max neval cld
#   rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777   100   b
# cumsum   1.511811   1.559563   2.221273   1.826404   2.475402   6.88169   100  a 

identical(x1, x2)
# [1] TRUE
zx8754
  • 52,746
  • 12
  • 114
  • 209
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
2

Try this:

#dummy data
x <- c(0,0,0,1,1,0,0,1,0)

#get group id using rleid from data.table
gr <- data.table::rleid(x)

#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)

#result
cbind(x, gr)
#      x gr
# [1,] 0  1
# [2,] 0  1
# [3,] 0  1
# [4,] 1  1
# [5,] 1  1
# [6,] 0  3
# [7,] 0  3
# [8,] 1  3
# [9,] 0  5

#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
#      x gr  
# [1,] 0  1
# [2,] 0  1
# [3,] 0  1
# [4,] 1  1
# [5,] 1  1
# [6,] 0  2
# [7,] 0  2
# [8,] 1  2
# [9,] 0  3
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 1
    +1, although my solution above using `diff` and `cumsum` is much shorter in code, and probably faster (although I have not tested this)? – Paul Hiemstra Oct 25 '16 at 09:43
  • hi, thanks for the solution, but still faster than my solution :) thank u ! – ZAWD Oct 25 '16 at 20:28