2

Let see the following example in dplyr style.

# 1. Data set
df <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
           "2019-01-20", "2019-01-24", "2019-01-30")))

# 2. Arrange data
df <- df %>% 
  arrange(g1, g2, date_obs)

# 3. Populate missing 'date_obs' and 'status' values
df_filled <- df %>%
  group_by(g1, g2) %>%
  complete(date_obs = seq.Date(min(date_obs), max(date_obs), by = "day")) %>%
  fill(status) %>%
  arrange(g1, g2, date_obs) %>% 
  ungroup()

How can I make the same operations by using data.table R syntax?

Thanks!

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Andrii
  • 2,843
  • 27
  • 33

2 Answers2

5

Another option using a rolling join.

setkey(DT, g1, g2, date_obs)

out <- DT[DT[, .(date_obs = seq(first(date_obs),
                                last(date_obs),
                                by = "day")), by=.(g1, g2)],
          on=.(g1, g2, date_obs),
          roll = TRUE]
out
#    g1 g2 status   date_obs
# 1:  1  1      0 2019-01-02
# 2:  1  2      1 2019-01-01
# 3:  1  2      1 2019-01-02
# 4:  1  2      1 2019-01-03
# 5:  1  2      1 2019-01-04
# 6:  1  2      1 2019-01-05
# 7:  1  2      1 2019-01-06
# 8:  1  2      1 2019-01-07
# 9:  1  2      1 2019-01-08
#10:  1  2      1 2019-01-09
#11:  1  2      1 2019-01-10
#12:  1  2      1 2019-01-11
#13:  1  2      1 2019-01-12
#14:  1  2      1 2019-01-13
#15:  1  2      1 2019-01-14
#16:  1  2      1 2019-01-15
#17:  1  2      1 2019-01-16
#18:  1  2      1 2019-01-17
#19:  1  2      1 2019-01-18
#20:  1  2      1 2019-01-19
#21:  1  2      1 2019-01-20
#22:  1  2      1 2019-01-21
#23:  1  2      1 2019-01-22
#24:  1  2      1 2019-01-23
#25:  1  2      1 2019-01-24
#26:  1  2      1 2019-01-25
#27:  1  2      1 2019-01-26
#28:  1  2      1 2019-01-27
#29:  1  2      1 2019-01-28
#30:  1  2      1 2019-01-29
#31:  1  2      1 2019-01-30
#32:  1  3      0 2019-01-15
#33:  2  1      0 2019-01-20
#34:  2  1      0 2019-01-21
#35:  2  1      0 2019-01-22
#36:  2  1      0 2019-01-23
#37:  2  1      1 2019-01-24
#38:  2  3      1 2019-01-12
#    g1 g2 status   date_obs

data

DT <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
                       "2019-01-20", "2019-01-24", "2019-01-30")))
markus
  • 25,843
  • 5
  • 39
  • 58
  • 1
    Does the ```setkey``` provide any improvements? Could you skip the ```on = .(...)``` argument? – Cole Jun 01 '19 at 11:11
  • 1
    @Cole I thought that it allows to skip the `on` but then I ended up with an error. I am still in the phase of learning all the `data.table` tricks. – markus Jun 01 '19 at 11:16
  • 2
    I really like data.table but there are still some mysteries for me. Thanks for your comments and solution - it's a cleaner way to fill the NAs with the rolling join. – Cole Jun 01 '19 at 11:22
2

The method is to generate a "complete" data.table with all of the date ranges by variables g1 and g2. Then we do a left join to the original data.table and then fill in the NA status.

This still relies on another package to do the fill method. I couldn't get tidyr::fill to work but zoo::na.locf worked like a charm. There are a lot of other options for filling NA values.

Replace NA with last non-NA in data.table by using only data.table

R data.table join/ subsetting/ match by group and by a condition

Replacing NAs with latest non-NA value

setorder(dt, g1, g2, date_obs)

dt_complete <- dt[, .(date_obs = seq.Date(min(date_obs), max(date_obs), by = "day")), by = .(g1, g2)]

dt[dt_complete
    , on = c('date_obs', 'g1', 'g2')
    , .(g1, g2, date_obs, status = zoo::na.locf(status))
    ]

    g1 g2   date_obs status
 1:  1  1 2019-01-02      0
 2:  1  2 2019-01-01      1
 3:  1  2 2019-01-02      1
 4:  1  2 2019-01-03      1
 5:  1  2 2019-01-04      1
 6:  1  2 2019-01-05      1
 7:  1  2 2019-01-06      1
 8:  1  2 2019-01-07      1
 9:  1  2 2019-01-08      1
... 38 total rows...

Data which is simply OP:

library(data.table)
library(tidyverse)
# 1. Data set
dt <- data.table(
  g1 = c(1, 1, 2, 1, 2, 2, 1),
  g2 = c(2, 1, 3, 3, 1, 1, 2),
  status = c(1, 0, 1, 0, 0, 1, 1),
  date_obs = as.Date(c("2019-01-01", "2019-01-02", "2019-01-12", "2019-01-15",
                       "2019-01-20", "2019-01-24", "2019-01-30")))
Cole
  • 11,130
  • 1
  • 9
  • 24
  • 1
    The current `dev` version of `data.table` provides the new function `nafill`. See [NEWS](https://github.com/Rdatatable/data.table/blob/master/NEWS.md) – markus Jun 01 '19 at 11:02