1

Is there an elegant way of filling in missing time periods as timetk::pad_by_time and tsibble::fill_gaps in data.table?

The data might look like this

library(data.table)
data<-data.table(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-03-01","2020-03-01","2020-03-01"),
             Card = c(1,2,3,1,3,1,2,3),
             A = rnorm(8)
)

The implicitly missing observation of card 2 at 2020-02-01.

In tsibble package, you can do the following

library(tsibble)
data <- data[, .(Date = yearmonth(ymd(Date)), 
               Card = as.character(Card),
              A= as.numeric(A))]
data<-as_tsibble(data, key = Card, index = Date)
data<-fill_gaps(data)

In timetk package, you can do the following

library(timetk)
data <- data[, .(Date = ymd(Date), 
             Card = as.character(Card),
             A= as.numeric(A))]
data<-data %>%
  group_by(Card) %>%
  pad_by_time(Date, .by = "month") %>%
  ungroup()
Nick
  • 81
  • 1
  • 6
  • Could you provide a specific example? – Waldi Oct 28 '21 at 04:57
  • What were the issues with using ``timetk`` or ``tsibble``? All these approaches seem to just ``expand`` or do a selfjoin, is the goal here to replicate without the need for external packages? – runr Oct 28 '21 at 05:46
  • ``expand.grid`` on unique values, and ``merge`` with a left join on the existing values, for a base R solution. Alternatively, can also look into ``tidyr::complete`` – runr Oct 28 '21 at 06:43
  • The data.table analogue of what runr described is covered here: https://stackoverflow.com/q/43483497/1191259 If you don't want to write the key cols multiple times, could also do `setkey(data, Date, Card); data[do.call(CJ, c(mget(key(data)), unique=TRUE))]` – Frank Oct 28 '21 at 07:03
  • @runr I was a bit worried about their performance and stability. Not sure if they work well in big data. – Nick Oct 28 '21 at 15:21

1 Answers1

4

Just data.table:

If no key is set, then

data2 <- data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
data2
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(updated/simplified, thanks to @sindri_baldur!)

If a key is set, then you can use @Frank's method:

data2 <- data[ do.call(CJ, c(mget(key(data)), unique = TRUE)), ]

And from here, you can use nafill as desired, perhaps

data2[, A := nafill(A, type = "locf"), by = .(Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1  1.37095845
# 2: 2020-01-01     2 -0.56469817
# 3: 2020-01-01     3  0.36312841
# 4: 2020-02-01     1  0.63286260
# 5: 2020-02-01     2 -0.56469817
# 6: 2020-02-01     3  0.40426832
# 7: 2020-03-01     1 -0.10612452
# 8: 2020-03-01     2  1.51152200
# 9: 2020-03-01     3 -0.09465904

(How to fill is based on your knowledge of the context of the data; it might just as easily be by=.(Date), or some form of imputation.)


Update: the above does an expansion of possible combinations, which might fill outside of a particular Card's span, in which case one might see:

data <- data[-1,]
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
#          Date  Card           A
#        <char> <num>       <num>
# 1: 2020-01-01     1          NA
# 2: 2020-01-01     2 -0.42225588
# 3: 2020-01-01     3 -0.12235017
# 4: 2020-02-01     1  0.18819303
# 5: 2020-02-01     2          NA
# 6: 2020-02-01     3  0.11916096
# 7: 2020-03-01     1 -0.02509255
# 8: 2020-03-01     2  0.10807273
# 9: 2020-03-01     3 -0.48543524

I think there are two approaches to this:

  1. Doing the above code and then removing leading (and trailing) NAs per group:

    data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)
      ][, .SD[ !is.na(A) | !seq_len(.N) %in% c(1, .N),], by = Card]
    #     Card       Date           A
    #    <num>     <char>       <num>
    # 1:     1 2020-02-01  0.18819303
    # 2:     1 2020-03-01 -0.02509255
    # 3:     2 2020-01-01 -0.42225588
    # 4:     2 2020-02-01          NA
    # 5:     2 2020-03-01  0.10807273
    # 6:     3 2020-01-01 -0.12235017
    # 7:     3 2020-02-01  0.11916096
    # 8:     3 2020-03-01 -0.48543524
    
  2. Completely different approach (assuming Date-class, not strictly required above):

    data[,Date := as.Date(Date)]
    data[data[, .(Date = do.call(seq, c(as.list(range(Date)), by = "month"))), 
              by = .(Card)],
         on = .(Date, Card)]
    #          Date  Card           A
    #        <Date> <num>       <num>
    # 1: 2020-01-01     2 -0.42225588
    # 2: 2020-02-01     2          NA
    # 3: 2020-03-01     2  0.10807273
    # 4: 2020-01-01     3 -0.12235017
    # 5: 2020-02-01     3  0.11916096
    # 6: 2020-03-01     3 -0.48543524
    # 7: 2020-02-01     1  0.18819303
    # 8: 2020-03-01     1 -0.02509255
    
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 3
    Without key you can simplify to `data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]`. – s_baldur Oct 28 '21 at 11:08
  • Thank you @sindri_baldur, that is indeed simpler. – r2evans Oct 28 '21 at 11:24
  • Thank you both. However, I tried this on a slightly different situation. It will mistakenly fill in gaps. Example: Missing observation Card 1 Date 2020-01-01 will be filled as well. But it should not. – Nick Oct 29 '21 at 15:38
  • Nick, see my edit, I think either of the add-on blocks may suffice. – r2evans Oct 29 '21 at 16:11
  • Thanks. Both work! And I guess the second approach will be more efficient in performance? – Nick Oct 29 '21 at 17:24
  • I haven't benchmarked it, tbh, that's best done with data of representative size. The only concern I'd have with the second is if it is not a safe assumption to use `seq.Date(..., by="month")`. (Perhaps a hole in the original/first implementation: if a date is missing for *all* `Card`s, then it won't appear after `CJ`, whereas the `seq.Date` implementation will catch it. Tradeoffs :-) – r2evans Oct 29 '21 at 17:42