0

I'm looking for a more efficient way to do following. I have a monthucket as a helper dataframe and df

library(dplyr)
set.seed(123)

monthbucket <- data.frame(
  startmonth = seq(as.Date("2010-01-01"),as.Date("2011-05-01"),by="months"),
  endmonth = seq(as.Date("2010-02-01"),as.Date("2011-06-01"),by="months")-1)


df <- data.frame(
start = sample(seq(as.Date("2010-01-01"),as.Date("2011-01-01"),by="months"),10,replace =T),
end = sample(seq(as.Date("2011-02-01"),as.Date("2011-05-01"),by="months"),10,replace =T),
sex =  sample(c('F','M'),10,replace =T),
group = sample(1:8,10,replace =T))

I want to get counts based on the monthbucket for the different features in the df. The following code works but gets tedious when you have more than 2 levels per feature. For instance getting the df$group would be pretty painful.


monthbucket %>% 
  group_by(startmonth) %>% 
  summarise(c.active= sum(df$start <=startmonth),
            c.termed= sum(df$end < endmonth),
            active= c.active-c.termed,
            c.active.F= sum(df$start <=startmonth & df$sex=='F'),
            c.termed.F= sum(df$end <endmonth & df$sex =='F'),
            active.F= c.active.F-c.termed.F,
            c.active.M= sum(df$start <=startmonth & df$sex=='M'),
            c.termed.M= sum(df$end < endmonth & df$sex =='M'),
            active.M= c.active.M-c.termed.M
  )

Two questions, first, I use the monthbucket as a helper dataframe to check the records fall within the time respective timespan. Is it possible to get rid of that extra step. Second how can I change my code to be easier to get counts of multiple levels per feature.

CER
  • 854
  • 10
  • 22
  • 2
    A good first step would be to use one of the answers [here](https://stackoverflow.com/q/4560459/903061) to convert `sex` and `group` to dummy encoding. Then I think you can get the rest by cross joining `df` to `monthbucket` and filtering (or fuzzyjoining, or using non-equi joins in data.table), and then doing a couple `summarize_at`s. I don't have time to write up a complete answer now, but I think that general idea should work. – Gregor Thomas Jul 11 '19 at 20:26
  • Yeah I stumbled upon one hot encoding which might be able to get things on the road for the dummy part – CER Jul 11 '19 at 20:32

0 Answers0