2

I'm stuck trying to find a relatively simple way to count occurrences within a date range by group using R. I get the idea there has to be an easier way than what I'm trying.

I have over 6,000 groups, each group has anywhere from 1 to 100 IDs within, each with a start date and an end date anywhere from Jan 1, 1990 to today. I want to make a dataframe, one group per column, and one day per row, counting the number of IDs active per day from April 1, 2013 until March 31, 2018. For obvious reasons, using countifs in excel will not cut it.

I was trying to use this question as a starting point, as such:

df1 <- data.frame(group = c(1,1,2,3,3),
              id = c(1,2,1,1,2),
              startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01","2016-04-01"), 
              enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05","2999-01-01"))

report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 7))
report <- cbind(report,matrix(data=NA,nrow=7,ncol=3))
names(report) <- c('date',as.vector(unique(df1$group)))

daily <- function(i,...){
    report[,i+1] <- sapply(report$date, function(x)
    sum(as.Date(df1$startdate) < as.Date(x) &
        as.Date(df1$enddate) > as.Date(x) & 
       df1$group == unique(df1$group)[i]))
        }

for (i in unique(df1$group))
  daily(i)

However, this doesn't seem to do anything (nor does it throw errors). Is there an easier way to do this? Am I way off base? Any help is appreciated for this non-programmer!

Additional help requested: I'm trying to modify Jaap's code in the answer below to include group start and group end times, so that the data table displays an NA when the group is not active.

Example data:

df2 <- data.frame(group = c(1,1,2,3,3),
                  groupopendate = c("2016-04-02","2016-04-02","2016-04-01","2016-04-02","2016-04-02"),
                  groupclosedate = c("2016-04-08","2016-04-08","2016-04-10","2016-04-09","2016-04-09"),
                  id = c(1,2,1,1,2),
                  startdate = c("2016-04-02","2016-04-04","2016-04-03","2016-04-02","2016-04-05"), 
                  enddate = c("2016-04-04","2016-04-06","2016-04-10","2016-04-08","2016-04-08"))

Jaap's solution gives me this:

       active grp1 grp2 grp3
1: 2016-04-02    1    0    1
2: 2016-04-03    1    1    1
3: 2016-04-04    1    1    1
4: 2016-04-05    1    1    2
5: 2016-04-06    0    1    2
6: 2016-04-07    0    1    2

However, what I want is such:

        active grp1 grp2 grp3
1:  2016-04-01   NA    0   NA
2:  2016-04-02    1    0    1
3:  2016-04-03    1    1    1
4:  2016-04-04    1    1    1
5:  2016-04-05    1    1    1
6:  2016-04-06    1    1    2
7:  2016-04-07    0    1    2
8:  2016-04-08   NA    1    0
9:  2016-04-09   NA    1   NA
10: 2016-04-10   NA   NA   NA

Any help is appreciated!

A. Mullins
  • 55
  • 6

2 Answers2

4

A possible alternative solution using :

# load the package & convert 'df1' to a data.table
library(data.table)
setDT(df1)

# convert the date columns to a date format
# not needed if they are 
df1[, `:=` (startdate = as.Date(startdate), enddate = as.Date(enddate))]

# create a new data.table with the 'active' days
DT <- data.table(active = seq(from = as.Date("2016-04-01"), by = "day", length.out = 7))

# use a join and dcast to get the desired result
DT[df1
   , on = .(active > startdate, active < enddate)
   , allow = TRUE
   , nomatch = 0
   , .(active = x.active, group, id)
   ][, dcast(.SD, active ~ paste0("grp",group), value.var = "id", fun = length)]

which gives:

       active grp1 grp2 grp3
1: 2016-04-01    1    1    0
2: 2016-04-02    1    1    1
3: 2016-04-03    1    1    1
4: 2016-04-04    0    1    1
5: 2016-04-05    1    1    1
6: 2016-04-06    1    1    1
7: 2016-04-07    1    1    1

NOTE: I've used paste0("grp",group) instead of just group in the dcast step as it leads to better columnnames (it is better not to use just numeric values as columnnames)


With regard to your additional example, you could solve that as follows:

setDT(df2)

df2[, c(2:3,5:6) := lapply(.SD, as.Date), .SDcols = c(2:3,5:6)]

DT <- data.table(active = seq(from = min(df2$groupopendate),
                              to = max(df2$groupclosedate),
                              by = "day"))

df2new <- df2[, .(active = seq.Date(startdate, enddate, by = "day"))
              , by = .(group, id)
              ][, .N, by = .(group, active)
                ][df2[, .(active = seq.Date(groupopendate[1], groupclosedate[.N] - 1, by = "day"))
                      , by = .(group)]
                  , on = .(group, active)
                  ][is.na(N), N := 0
                    ][, dcast(.SD, active ~ paste0("grp",group))]

nms <- setdiff(names(df2new), "active")

DT[df2new
   , on = .(active)
   , (nms) := mget(paste0("i.",nms))][]

which gives:

> DT
        active grp1 grp2 grp3
 1: 2016-04-01   NA    0   NA
 2: 2016-04-02    1    0    1
 3: 2016-04-03    1    1    1
 4: 2016-04-04    2    1    1
 5: 2016-04-05    1    1    2
 6: 2016-04-06    1    1    2
 7: 2016-04-07    0    1    2
 8: 2016-04-08   NA    1    2
 9: 2016-04-09   NA    1   NA
10: 2016-04-10   NA    1   NA
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    This is great solution! It is **much** faster than using the for loop. Thanks! – A. Mullins Oct 03 '18 at 17:22
  • My groups also start and end at different times (grpstart and grpend are example date variables). Would there be a way to modify the above code so that it gives a value of NA when a group is not active, while keeping the 0 values when the group is active but has no observation on the date? I've been messing with it for a couple days but have had no luck. – A. Mullins Oct 09 '18 at 16:32
  • @A.Mullins Will try to look at it later today. Could you include example data in your question of the situation you described? – Jaap Oct 10 '18 at 06:25
  • I've added some example data in the question. Thank you again for all your assistance! – A. Mullins Oct 12 '18 at 15:12
  • It's perfect! Thank you so much! And now I have more to go off of when learning the data.table package. – A. Mullins Oct 15 '18 at 17:01
1

I've figured it out! As usual, as soon as you post a question, you figure out the answer. I was overcomplicating it by putting in the function, when I could just put the sapply in the for loop.

If anyone is interested:

for (i in unique(df1$group))
  {report[,i+1] <- 
  sapply(report$date, function(x)
      sum(as.Date(df1$startdate) < as.Date(x) &
      as.Date(df1$enddate) > as.Date(x) & 
      df1$group == unique(df1$group)[i]))}
MrFlick
  • 195,160
  • 17
  • 277
  • 295
A. Mullins
  • 55
  • 6