2

I have a data table like this, just much bigger:

customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m- 
%d"), as.Date("2017-06-01","%Y-%m-%d"),
          as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m- 
%d"), as.Date("2017-05-01","%Y-%m-%d"),
          as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m- 
%d"), as.Date("2017-04-01","%Y-%m-%d"),
          as.Date("2017-05-01","%Y-%m-%d"))


tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)

my_data <- data.table(customer_id,account_id,time,tenor,variable_x)

customer_id account_id       time tenor variable_x
          1         11 2017-01-01     1         87
          1         11 2017-05-01     2         90
          1         11 2017-06-01     3        100
          2         55 2017-02-01     1        120
          2         55 2017-04-01     2        130
          2         55 2017-05-01     3        150
          2         55 2017-06-01     4         12
          3         38 2017-01-01     1         13
          3         38 2017-04-01     2         15
          3         38 2017-05-01     3         14

in which I should observe for each pair of customer_id, account_id monthly observations from 2017-01-01 to 2017-06-01, but for some customer_id, account_id pairs some dates in this sequence of 6 months are missing. I would like to fill in those missing dates such that each customer_id, account_id pair has observations for all 6 months, just with missing variables tenor and variable_x. That is, it should look like this:

    customer_id account_id       time tenor variable_x
           1         11    2017-01-01     1         87
           1         11    2017-02-01    NA         NA
           1         11    2017-03-01    NA         NA
           1         11    2017-04-01    NA         NA
           1         11    2017-05-01     2         90
           1         11    2017-06-01     3        100
           2         55    2017-01-01    NA         NA
           2         55    2017-02-01     1        120
           2         55    2017-03-01    NA         NA
           2         55    2017-04-01     2        130
           2         55    2017-05-01     3        150
           2         55    2017-06-01     4         12
           3         38    2017-01-01     1         13
           3         38    2017-02-01    NA         NA
           3         38    2017-03-01    NA         NA
           3         38    2017-04-01     2         15
           3         38    2017-05-01     3         14
           3         38    2017-06-01    NA         NA

I tried creating a sequence of dates from 2017-01-01 to 2017-06-01 by using

ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")

and then merge it to the original data with

ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)

but it is not working. Please, do you know how to add such rows with dates for each customer_id, account_id pair?

doremi
  • 141
  • 3
  • 15

2 Answers2

3

We can do a join. Create the sequence of 'time' from min to max by '1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on with those columns and the 'time'

ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)], 
             on = .(customer_id, account_id, time)]
#    customer_id account_id       time tenor variable_x
# 1:           1         11 2017-01-01     1         87
# 2:           1         11 2017-02-01    NA         NA
# 3:           1         11 2017-03-01    NA         NA
# 4:           1         11 2017-04-01    NA         NA
# 5:           1         11 2017-05-01     2         90
# 6:           1         11 2017-06-01     3        100
# 7:           2         55 2017-01-01    NA         NA
# 8:           2         55 2017-02-01     1        120
# 9:           2         55 2017-03-01    NA         NA
#10:           2         55 2017-04-01     2        130
#11:           2         55 2017-05-01     3        150
#12:           2         55 2017-06-01     4         12
#13:           3         38 2017-01-01     1         13
#14:           3         38 2017-02-01    NA         NA
#15:           3         38 2017-03-01    NA         NA
#16:           3         38 2017-04-01     2         15
#17:           3         38 2017-05-01     3         14
#18:           3         38 2017-06-01    NA         NA

Or using tidyverse

library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
      mutate(time = list(ts1)) %>% 
      unnest %>% 
      left_join(my_data)

Or with complete from tidyr

my_data %>% 
     complete(nesting(customer_id, account_id), time = ts1)
akrun
  • 874,273
  • 37
  • 540
  • 662
1

A different data.table approach:

my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), 
                              by = "month")), by = list(customer_id, account_id)]

merge(my_data2, my_data, all.x = TRUE)

     customer_id account_id       time tenor variable_x
 1:           1         11 2017-01-01     1         87
 2:           1         11 2017-02-01    NA         NA
 3:           1         11 2017-03-01    NA         NA
 4:           1         11 2017-04-01    NA         NA
 5:           1         11 2017-05-01     2         90
 6:           1         11 2017-06-01     3        100
 7:           2         55 2017-01-01    NA         NA
 8:           2         55 2017-02-01     1        120
 9:           2         55 2017-03-01    NA         NA
10:           2         55 2017-04-01     2        130
11:           2         55 2017-05-01     3        150
12:           2         55 2017-06-01     4         12
13:           3         38 2017-01-01     1         13
14:           3         38 2017-02-01    NA         NA
15:           3         38 2017-03-01    NA         NA
16:           3         38 2017-04-01     2         15
17:           3         38 2017-05-01     3         14
18:           3         38 2017-06-01    NA         NA
tmfmnk
  • 38,881
  • 4
  • 47
  • 67