Numeric sequence with a condition (continued). A data.table solution?

Question

Yesterday I asked a question. Complex sequence based on a condition

Thank you to those who helped me solve it. My minimal example was

library(dplyr)    
ID = c(101, rep(102, 2), rep(103,5))
    start = as.Date(c('2/1/2010', rep('5/17/2011', 2), rep('5/17/2011', 5)), '%m/%d/%Y')
    end = as.Date(c('3/5/2010', rep('1/4/2012', 2 ), rep('8/4/2013', 5 )), '%m/%d/%Y')
    data = data.frame(ID = ID, start = start, end = end)

    v = c(0,1)
    data = data %>% group_by(ID) %>% mutate(PolYr = rep_len(v, length(ID)))
    data

Now I am hoping someone can help me with this part of the code.

v = c(0,1)
data = data %>% group_by(ID) %>% mutate(PolYr = rep_len(v, length(ID)))

The code runs. However, on my real data with more than 2 million rows of data and hundreds of thousands of ID, the elapsed time was 2297.74. I am hoping someone can suggest a faster method, perhaps with data.table, which I am just trying to learn. The goal is for each ID to start PolYr with a 0 and then continue with a 1 (if there is a second row) and then back to 0, 1, …

Can you change it to `n()` `mutate(PolYr = rep_len(v, n()))` — akrun, May 11 '20 at 21:25
`setDT(data)[, PolYr := rep_len(v, .N), .ID]` in `data.table` — akrun, May 11 '20 at 21:26

score 2 · Accepted Answer · answered May 11 '20 at 21:27

2

In data.table, we can use .N in rep_len and assign (:=) to create the new column after grouping by ID

library(data.table)
setDT(data)[, PolYr := rep_len(v, .N), by = ID]

answered May 11 '20 at 21:27

akrun

874,273
37
540
662

1

Thanks. The data.table and dplyr methods produced the same result. I started with a tibble with 2,395,812 rows and 417,438 values for the grouping variable. Using system.time() data.table took elapsed time of 0.39 seconds; and dplyr took elapsed time of 1862.56 seconds. The alternate method suggested by @Akrun reduced the time to 2.31 seconds. The data.table solution really helps because the code needs to be run on over 20 datasets. – user2738483 May 11 '20 at 23:35
you meant `length` is taking less time than `.N`? – akrun May 11 '20 at 23:36
1

To clarify. data.table 0.39 seconds; dplyr with length 1862.56 seconds; dplyr with n() 2.31 seconds – user2738483 May 11 '20 at 23:55

Numeric sequence with a condition (continued). A data.table solution?

1 Answers1