4

I have a following data frame:

library(tidyverse)
df <- data_frame(
  id = c(1, 1, 2, 2), 
  date1 = as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")), 
  date2 = as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10"))
)

# A tibble: 4 x 3
     id      date1      date2
  <dbl>     <date>     <date>
1     1 2013-01-01 2012-12-09
2     1 2013-02-01 2012-12-09
3     2 2015-04-01 2015-03-10
4     2 2015-05-01 2015-03-10

And I want to complete this data frame such that for each id, there will be another date1 value. This another date1 value is computed as the next month. Also there is a date2 value which is same for all id's. With tidyr::complete this action can be done like this:

df %>% 
  group_by(id) %>% 
  complete(date1 = seq.Date(from = min(date1), length.out = 3, by = "month"), date2 = date2[1])

# A tibble: 6 x 3
# Groups:   id [2]
     id      date1      date2
  <dbl>     <date>     <date>
1     1 2013-01-01 2012-12-09
2     1 2013-02-01 2012-12-09
3     1 2013-03-01 2012-12-09
4     2 2015-04-01 2015-03-10
5     2 2015-05-01 2015-03-10
6     2 2015-06-01 2015-03-10

Since I have about 150K groups in my original data, the tidyr solution is taking more than hour to complete. I am assuming that speed would be gained using data.table. Can the same thing be done in data.table?

Similar questions has been asked in data.table equivalent of tidyr::complete() but without group_by clause.

Sotos
  • 51,121
  • 6
  • 32
  • 66
rToO
  • 75
  • 3
  • 2
    Using `data.table` `setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]` – akrun Nov 03 '17 at 12:14
  • See @Uwe's answer here https://stackoverflow.com/questions/46330683/inserting-rows-into-data-frame-when-values-missing-in-category – IceCreamToucan Nov 03 '17 at 17:15

2 Answers2

3

Based on some initial benchmarking the data.table approach seems to be faster

library(data.table)
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]

Benchmarks

 df <- data_frame(
  id = rep(1:3000, each = 2), 
  date1 = rep(as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")),
  length.out = 6000), 
  date2 = rep(as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10")), 
   length.out = 6000))

system.time({
df %>% 
  group_by(id) %>% 
  complete(date1 = seq.Date(from = min(date1), 
          length.out = 3, by = "month"), date2 = date2[1])
})
#user  system elapsed 
#64.05   21.27   86.05 

system.time({
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]
})
#user  system elapsed 
#  0.14    0.00    0.14 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 3
    Is that data.table code equivalent? Seems like you'd take an object like your result and then left join to the main table to "complete"..? – Frank Nov 03 '17 at 12:23
  • 2
    @Frank It is a different way, but `system.time({ + setDT(df)[df[,.(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id], on = .(id, date1, date2)] + })# user system elapsed 0.31 0.03 0.59` is still giving me less time – akrun Nov 03 '17 at 14:37
0

If you need the speed, keeping the by as lean as possible:

library(data.table)
library(lubridate)

> dt[, .SD
     ][, .(date1=max(date1)), .(id, date2)
     ][, date1Inc := date1 + months(1)
     ][, rbind(dt, .SD[, .(id, date1=date1Inc, date2)])
     ][order(id, date1)
     ]

   id      date1      date2
1:  1 2013-01-01 2012-12-09
2:  1 2013-02-01 2012-12-09
3:  1 2013-03-01 2012-12-09
4:  2 2015-04-01 2015-03-10
5:  2 2015-05-01 2015-03-10
6:  2 2015-06-01 2015-03-10
>   
> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46