0

I would like to create a two flags

order.last.30.days

order.anytime.in.past

on the following data.

library(data.table)
library(lubridate)

my.data <- data.table(

  supplier = c("a","a","a","a","a","a","b","b","b","b","b","b"),
  date = rep(c("2017-06-01","2017-03-01","2017-02-01","2017-01-12",
                "2017-05-01","2017-04-01"), 2), 
  order = c(1,0,0,1,1,0,0,1,0,0,1,0)

)

my.data[,date := ymd(date)]
setorder(my.data, supplier, date)

my.data[, prev.date := shift(date, type = c("lag")),
        by = .(supplier)]

my.data[, days.btw.dates := time_length( interval(prev.date, date), 
                                          unit = "days")]

How can I do this using shift in the data.table package?

iboboboru
  • 1,112
  • 2
  • 10
  • 21
  • 2
    Note that `data.table` has a fairly thick set of date functions, so that you don't need to use `lubridate`. – lmo Aug 11 '17 at 11:00

1 Answers1

0

Came up with rough solution in two parts.

order.anytime.in.past

#Calculate cumsum of order for supplier and set to 1 if greater then zero.
#Then remove the first cumsum as should be NA as no historic data

my.data[, order.any.time.in.past := ifelse(cumsum(order) > 0, 1,0), 
        by = .(supplier)]

my.data[, order.any.time.in.past := replace(order.any.time.in.past,
                                             seq_len(.N)==1,NA) ,
        by=supplier]

order.last.30.days

Inspired by this answer Conditional rolling mean (moving average) on irregular time series

## Solution using dplyr, tidyr, RccpRoll
## add extra row with the min(date) per group less time.window-2 days
## use tidyr::complete and tidyr::full_seq to seq along day per group
## use Rccp::roll_sum() to count up across the time window. The padding 
## of earliest date make sures counts start happening for earlier dates
## (i.e.) don't wait until hit 30 days.
## drop any of the dates that were not actual dates
## make data.table and set first values to NA as no real historic data

time.window = 30

my.data <- my.data %>% 
  group_by(supplier) %>% 
  do(add_row(.,
             supplier = unique(.$supplier), 
             date = (min(.$date) - (time.window-2)), 
             .before=0)) %>%
  tidyr::complete(date = tidyr::full_seq(date,1)) %>%
  mutate(count.orders = RcppRoll::roll_sum(order, 
                                           time.window, align = c("right"), 
                                           fill = NA, na.rm = TRUE)) %>%
  mutate(order.last.30.days = ifelse(count.orders >0, 1,0)) %>%
  select(-count.orders) %>%
  filter(!is.na(order)) %>%
  data.table()

Nice to come up with good data.table answer. Suggestions welcome.

iboboboru
  • 1,112
  • 2
  • 10
  • 21
  • Fyi, `ifelse()` is discouraged: https://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow and `do()` is notoriously slow. Also, I guess most people (like me) would discourage mixing dplyr and data.table syntax. If you prefer dplyr's, use dtplyr. – Frank Aug 11 '17 at 14:43
  • Thanks @Frank - I know do is slow. Trying to solve using nest() and purrr package. Do you have any suggestions how to solve add_row with purrr ? – iboboboru Aug 11 '17 at 14:49
  • Nope, sorry; I get everything done with data.table and magrittr in my work. Btw, generally you should use a question to ask for one thing and explain what it is, showing desired output, in the question. – Frank Aug 11 '17 at 14:54