0

I am trying to create a dummy variable in row A based on if rows within 1 year of row A, x occurs. I believe this is probably a common issue, and there are similar questions already posted (this is the most similar I found). Unfortunately the zoo package doesn't fit well since it doesn't deal well with irregular spaced dates (I don't want to aggregate rows and my data is too large to deal with this well) and I have been trying unsuccessfully to figure out a datatable way to do this, though I would prefer tidyverse given my experience.

dates <- rep(as.Date(c('2015-01-01', '2015-02-02', '2015-03-03', '2016-02-02'), '%Y-%m-%d'), 3)

names <- c(rep('John', 4), rep('Phil', 4), rep('Ty', 4))

df <- data.frame(Name = names, Date = dates,
             did_y = c(0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0),
             did_x = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1))

Name       Date       did_y   did_x
John      2015-01-01    0      1
John      2015-02-02    1      0
John      2015-03-03    1      0
John      2016-02-02    0      0
Phil      2015-01-01    1      0
Phil      2015-02-02    1      1
Phil      2015-03-03    0      1
Phil      2016-02-02    0      0       
Ty        2015-01-01    0      0
Ty        2015-02-02    0      0
Ty        2015-03-03    0      0
Ty        2016-02-02    0      1

What I'd like is

dffinal <- data.frame(Name = names, Date = dates, 
                  did_y = c(0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0),
                  did_x = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1),
                  did_x_within_year = c(1, 1, 1, NA, 1, 1, 1, 1, 0, 1, 1, 1),
                  did_x_next_year = c(0, 0, 0, NA, 1, 1, 0, NA, 0, 1, 1, NA))

Name       Date       did_y   did_x   did_x_within_year   did_x_next_year
John      2015-01-01    0      1            1                   0
John      2015-02-02    1      0            1                   0
John      2015-03-03    1      0            1                   0 
John      2016-02-02    0      0            NA                  NA
Phil      2015-01-01    1      0            1                   1
Phil      2015-02-02    1      1            1                   1
Phil      2015-03-03    0      1            1                   0
Phil      2016-02-02    0      0            1                   NA   
Ty        2015-01-01    0      0            0                   0
Ty        2015-02-02    0      0            1                   1
Ty        2015-03-03    0      0            1                   1
Ty        2016-02-02    0      1            1                   NA

So I'd like two columns, one for if x occurred within 1 year of row A (regardless of before or after) and another if it occurred within 1 year in the future.

I experimented with RcppRoll, but it seems to only look backward in dates, i.e. if something happened a year before it will dummy, but not if it will occur 1 year in the future.

df$did_x_next_year <- roll_max(df$did_x, 365, fill = NA)

EDIT: Attempted solution based on other question

I have tried to implement this solution (1b), unfortunately nothing in my dataframe/datatable actually changes. Even if I leave the function as is from the example when applied to my data, it does not update.

library(zoo)
library(data.table)
df$Year <- lubridate::year(df$Date)
df$Month <- lubridate::month(df$Date)
df$did_x_next_year <- df$did_x

DT <- as.data.table(df)

k <- 12 # prior 12 months

# inputs zoo object x, subsets it to specified window and sums
Max2 <- function(x) {
  w <- window(x, start = end(x) - k/12, end = end(x) - 1/12)
  if (length(w) == 0 || all(is.na(w))) NA_real_ else max(w, na.rm = TRUE)
}

nms <- names(DT)[7]

setkey(DT, Name, Year, Month) # sort

# create zoo object from arguments and run rollapplyr using Sum2
roll2 <- function(x, year, month) {
  z <- zoo(x, as.yearmon(year + (month - 1)/12))
  coredata(rollapplyr(z, k+1, Max2, coredata = FALSE, partial = TRUE))
}

DT <- DT[, nms := lapply(.SD, roll2, Year, Month), .SDcols = nms, by = "Name"]
vino88
  • 163
  • 13
  • row A means row 1?? – i.n.n.m Aug 11 '17 at 21:31
  • Well I am grouping the data based on the Name column and I am looking for the time window to roll forward with each row so the calculation would be done look forward and backward from the date of each row. – vino88 Aug 11 '17 at 21:35
  • So you want a rolling mean or interpolate? – i.n.n.m Aug 11 '17 at 21:41
  • Sorry if I am missing your meaning, but I am trying to 1. group by Name, 2. for each row then look forward from the date of the row to all subsequent rows within a year, 3. if in those subsequent rows, did_x = 1, then to give the did_x_within_year = 1 – vino88 Aug 11 '17 at 21:48
  • in `did_x_within_year` `Name=Ty` should it be 0 rather than 1 ? – BENY Aug 11 '17 at 22:09
  • @Wen, well I'm counting the date up to a year so I'm including the edge case. – vino88 Aug 11 '17 at 22:13
  • This is a duplicate of https://stackoverflow.com/questions/45549475/how-can-i-roll-up-lagged-time-data-given-conditions-in-a-data-table-in-r/45553386#45553386 – G. Grothendieck Aug 12 '17 at 11:01
  • Thank you @G.Grothendieck, I'm very new to data.table and zoo so trying to parse that other answer together. I wonder if there is a tidyverse solution to this? – vino88 Aug 15 '17 at 20:03

1 Answers1

0

After a suggestion from a friend, I came up with the following:

# Filtering to the obs I care about
dfadd <- df %>% filter(did_x == 1) %>% select(Name, Date) %>% rename(x_date = Date)

# Converting to character since in dcast it screws up the dates
dfadd$x_date <- as.character(dfadd$x_date)

# Merging data
df <- plyr::join(df, dfadd, by = 'Name')

# Creating new column used for dcasting
df <- df %>% group_by(Name, Date) %>% mutate(x_date_index = seq(from = 1, to = n()))
df$x_date_index <- paste0('x_date_',df$x_date_index)

#casting the data wide
df <- reshape2::dcast(df,
                  Name + Date + did_y + did_x ~ x_date_index,
                  value.var = "x_date",
                  fill = NA)

# Converting to back to date
df$x_date_1 <- as.Date(df$x_date_1)
df$x_date_2 <- as.Date(df$x_date_2)

# Creating dummy variables
df$did_x_within_year <- 0
df$did_x_within_year <- ifelse((df$x_date_1 - df$Date) <= 366, 1, 
df$did_x_within_year)

df$did_x_next_year <- 0
df$did_x_next_year <- ifelse(((df$x_date_1 > df$Date) & (df$x_date_1 - df$Date<= 365)), 
                         1, df$did_x_next_year)

# Can extend to account for x_date_2, x_date_3, etc

# Changing the last entry to NA as desired
df <- df %>% group_by(Name) %>% mutate(did_x_next_year = c(did_x_next_year[-n()], NA))
vino88
  • 163
  • 13