2

Yesterday I asked a question. Complex sequence based on a condition

Thank you to those who helped me solve it. My minimal example was

library(dplyr)    
ID = c(101, rep(102, 2), rep(103,5))
    start = as.Date(c('2/1/2010', rep('5/17/2011', 2), rep('5/17/2011', 5)), '%m/%d/%Y')
    end = as.Date(c('3/5/2010', rep('1/4/2012', 2 ), rep('8/4/2013', 5 )), '%m/%d/%Y')
    data = data.frame(ID = ID, start = start, end = end)

    v = c(0,1)
    data = data %>% group_by(ID) %>% mutate(PolYr = rep_len(v, length(ID)))
    data

Now I am hoping someone can help me with this part of the code.

v = c(0,1)
data = data %>% group_by(ID) %>% mutate(PolYr = rep_len(v, length(ID)))

The code runs. However, on my real data with more than 2 million rows of data and hundreds of thousands of ID, the elapsed time was 2297.74. I am hoping someone can suggest a faster method, perhaps with data.table, which I am just trying to learn. The goal is for each ID to start PolYr with a 0 and then continue with a 1 (if there is a second row) and then back to 0, 1, …

user2738483
  • 147
  • 1
  • 2
  • 11

1 Answers1

2

In data.table, we can use .N in rep_len and assign (:=) to create the new column after grouping by ID

library(data.table)
setDT(data)[, PolYr := rep_len(v, .N), by = ID]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks. The data.table and dplyr methods produced the same result. I started with a tibble with 2,395,812 rows and 417,438 values for the grouping variable. Using system.time() data.table took elapsed time of 0.39 seconds; and dplyr took elapsed time of 1862.56 seconds. The alternate method suggested by @Akrun reduced the time to 2.31 seconds. The data.table solution really helps because the code needs to be run on over 20 datasets. – user2738483 May 11 '20 at 23:35
  • you meant `length` is taking less time than `.N`? – akrun May 11 '20 at 23:36
  • 1
    To clarify. data.table 0.39 seconds; dplyr with length 1862.56 seconds; dplyr with n() 2.31 seconds – user2738483 May 11 '20 at 23:55