5

I have a data.table that looks like this

ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B

Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is

ID, Consec
1, 2
1, 4

Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).

The loop solution is quite obvious but would also be very slow.

Are there elegant solutions in data.table that is also fast?

P.S. The data I am dealing with has ~20 million rows.

xiaodai
  • 14,889
  • 18
  • 76
  • 140

1 Answers1

10

Try

 library(data.table)#v1.9.5+
  DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B',
    list(Consec=.N), by = list(indx, ID)][,indx:=NULL][]

 #    ID Consec
 #1:  1      2
 #2:  1      4

Or as @eddi suggested

 DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment, 
              rleid(Segment))][Segment == 'B', .(ID, Consec)]
 #    ID Consec
 #1:  1      2
 #2:  1      4

A more memory efficient method would be to use setorder instead of order (as suggested by @Arun)

  setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment, 
                rleid(Segment))][Segment == 'B', .(ID, Consec)]
  #   ID Consec
  #1:  1      2
  #2:  1      4
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 6
    you don't really need to explicitly create a new column, and can do it on the fly: `DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment, rleid(Segment))][Segment == 'B', .(ID, Consec)]` – eddi Apr 15 '15 at 06:14
  • Why do you need `order()`? – Arun Apr 16 '15 at 00:15
  • 1
    @Arun It gives a different answer without the order here `set.seed(42); d1 <- data.frame(ID= sample(1:4, 30, replace=TRUE), Order= sample(1:10, 30, replace=TRUE), Segment= sample(LETTERS[1:3], 30, replace=TRUE), stringsAsFactors=FALSE); DT <- as.data.table(d1)` – akrun Apr 16 '15 at 05:03
  • 1
    @akrun, thanks. In that case I'd use `setorder()` if possible as `DT[order()][..]` would create the ordered data entirely in a new location in memory. – Arun Apr 16 '15 at 08:21
  • I asked a similar question that drew several useful responses. Worth a review: http://stackoverflow.com/questions/27280948/how-can-i-analyze-the-consecutive-persistence-of-signals-1s-and-0s-in-dataset [1]: http://stackoverflow.com/questions/27280948/how-can-i-analyze-the-consecutive-persistence-of-signals-1s-and-0s-in-dataset – James Picerno Apr 17 '15 at 00:17