R Padding multiple time series from within an unbalanced panel data set

Question

I have a panel data set for daily revenue (and other variables) by ID, where the day with 0 revenue go unreported. I want to fill in these blanks with 0 for my analysis, meaning that for each ID's time series, I need to make sure there is an observation for each day. Each series can begin or end on a date distinct from the other series. I have been attempting to use the "padr" package, but I keep getting an "unused argument" error using the following sample code:

library(padr)
library(dplyr)
#unbalanced panel data
ID <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,3,
        4,4,4)
DT <-  today() + c(1,3,4,5, #ID = 1
                  3,4,7,8,9,10, #ID = 2
                  2,5,6,7,8,9,10, #ID = 3
                  8,10,11) #ID = 4

#The end date denote the max date for each ID
EndDT <-  today() + c(5,5,5,5, #ID = 1
                      13,13,13,13,13,13, #ID = 2
                      10,10,10,10,10,10,10, #ID = 3
                      15,15,15) #ID = 4

#random variables v1 and v2 to represent revenue and other variables
set.seed(1)
v1 <- rnorm(20,mean = 10000, sd = 5)
v2 <- rnorm(20,mean = 5000, sd = 1.5)

df <- as.data.frame(cbind(ID,DT,EndDT,v1,v2))

#format to simpler date
df$DT <- as.Date(DT, origin="1970-01-01")
df$EndDT <- as.Date(EndDT, origin="1970-01-01")

df_padded <- arrange(df,ID,DT) %>%
          pad(by='DT',group='ID', end_val='EndDT') %>%
          fill_by_value(v1,v2, value=0)

My error message:

Error in pad(., by = "DT", group = "ID", end_val = "EndDT") : 
  unused argument (group = "ID")

Answers not involving padr are also highly welcome.

Analyst Guy · Answer 1 · 2018-04-25T21:33:10.670

After tussling with padr for a while, I decided to write my own function. This function works for the example set, but quickly ran into issues with real data. Either way, I figured this may be of use to someone else, so here it is:

date.pad <- function(df, date.var, group, replace.vars, new.val=0){
  require("dplyr")
  require("lazyeval")
  require("lubridate")
  tempdf1 <- arrange_(df,group,date.var)
  finaldf <- tempdf1[0,]
  unique.id <- unique(tempdf1[,group])
  nonreplaced.vars <- setdiff(colnames(tempdf1),replace.vars)
  nonreplaced.vars <- nonreplaced.vars[!nonreplaced.vars==date.var]

  for(i in seq_along(unique.id)){
    filter_criteria <- interp(~y==x, .values=list(y=as.name(group),x=i)) #necessary for NSE
    tempdf2 <- filter_(tempdf1,filter_criteria) 
    min.date <- min(tempdf2[[date.var]])
    max.date <- max(tempdf2[[date.var]])
    all.days <- as.Date(seq(min.date,max.date,by="days"),origin="1970-01-01")
    distinct.days <- unique(tempdf2[,date.var])
    app.days <- as.Date(setdiff(all.days,distinct.days),origin="1970-01-01")
    tempdf3 <- tempdf2[0,]

      for(n in seq_along(app.days)){
          tempdf3[n,date.var] <- app.days[n]
      }
      for(j in seq_along(nonreplaced.vars)){
          tempdf3[1:nrow(tempdf3),nonreplaced.vars[j]] <- tempdf2[1,nonreplaced.vars[j]]
      }
      finaldf <- bind_rows(finaldf,tempdf3)
  }
  finaldf[replace.vars] <-new.val
  finaldf <- bind_rows(finaldf,df) %>% arrange_(group,date.var)
  return(finaldf)
} 

for.exmpl <- date.pad(df=df1, date.var="DT", group="ID", replace.vars=c("v1","v2"), new.val=0)
for.exmpl

score 0 · Accepted Answer · answered May 03 '18 at 15:05

Here is a new answer I've devised that is far more applicable outside my one application, and uses way less code:

library(tidyverse)

temp <- group_by(df1,ID) %>%
        complete(DT = seq.Date(min(DT),max(EndDT),by="day")) %>%
        fill(EndDT,sometext) %>%
        arrange(ID,DT)
temp[is.na(temp)] <- 0
View(temp)

Which results in:

# A tibble: 33 x 6
# Groups:   ID [4]
      ID DT         EndDT          v1    v2 sometext
   <dbl> <date>     <date>      <dbl> <dbl> <chr>   
 1    1. 2018-05-04 2018-05-08  9997. 5001. textvar
 2    1. 2018-05-05 2018-05-08     0.    0. textvar
 3    1. 2018-05-06 2018-05-08 10001. 5001. textvar
 4    1. 2018-05-07 2018-05-08  9996. 5000. textvar
 5    1. 2018-05-08 2018-05-08 10008. 4997. textvar
 6    2. 2018-05-06 2018-05-16 10002. 5001. textvar
 7    2. 2018-05-07 2018-05-16  9996. 5000. textvar
 8    2. 2018-05-08 2018-05-16     0.    0. textvar
 9    2. 2018-05-09 2018-05-16     0.    0. textvar
10    2. 2018-05-10 2018-05-16 10002. 5000. textvar
# ... with 23 more rows

(Please ignore the "sometext" variable. I created that while testing out my function below.)

score 0 · Answer 3 · answered May 08 '18 at 11:00

Your code does not run because you specify a character at the end_val argument. This should be a Date, you can only specify a single date over all the groups.

In order to do what you want with padr, you should combine the DT and EndDT columns. This way for each ID its final date is present in the DT column:

df %>% 
  group_by(ID) %>% 
  summarise(DT = max(EndDT)) %>% 
  mutate(v1 = NA, v2 = NA) %>% 
  bind_rows(df %>% select(-EndDT), .) %>% 
  group_by(ID, DT) %>% 
  filter(row_number() == 1) %>% 
  group_by(ID) %>% 
  pad()

R Padding multiple time series from within an unbalanced panel data set

3 Answers3