R: How to sum based on multiple criteria and summarize table

Question

Here is my original data frame:

df <- read.table(text="
  Date         Index  Event
  2014-03-31   A      x
  2014-03-31   A      x
  2014-03-31   A      y
  2014-04-01   A      y
  2014-04-01   A      x
  2014-04-01   B      x
  2014-04-02   B      x
  2014-04-03   A      x
  2014-09-30   B      x", header = T, stringsAsFactors = F)

date_range <- seq(as.Date(min(df$Date)), as.Date(max(df$Date)), 'days')
indices <- unique(df$Index)
events_table <- unique(df$Event)

I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column for all dates prior to the value in the Date column. Sometimes there are no records for each index or every date.

Here is my desired output:

Date        Index  cumsum(Event = x) cumsum(Event = y)
2014-03-31  A      0                 0
2014-03-31  B      0                 0
2014-04-01  A      2                 1
2014-04-01  B      0                 0
2014-04-02  A      3                 2
2014-04-02  B      1                 0
...  
2014-09-29  A      4                 2
2014-09-29  B      2                 0
2014-09-30  A      4                 2
2014-09-30  B      2                 0

FYI -- this is a simplified version of the data frame. There are ~200,000 records per year with hundreds of different Index fields for each Date.

I've done this in the past before my hard drive fried using by and maybe aggregate, but the process was very slow and I'm not able to get it worked out this time around. I've also tried ddply, but I'm not able to get the cumsum function to work with it. Using ddply, I tried something like:

ddply(xo1, .(Date,Index), summarise, 
      sum.x = sum(Event == 'x'), 
      sum.y = sum(Event == 'y'))

to no avail.
Through searching, I've found Replicating an Excel SUMIFS formula which gets me the cumulative part of my project, but with this I wasn't able to figure out how to summarize it down to only one record per date/index combo. I also came across sum/aggregate data based on dates, R but here I wasn't able to work out the dynamic date aspect.

Thanks for anyone that can help!

I am confused by your expected output. In one row of your expected output, you expect `cumsum(Event = x)` to be `0`? Even though your original `df` has 2 rows with `Date = 2014-03-31`, `Index = A`, and `Event = x`? — davechilders, Jan 28 '15 at 16:47
Adding emphasis "I want my desired output to summarise my dataframe and have a unique record for each index in indices and each date in date_range while providing a cumulative value of each event in events_table in a new column **for all dates prior to the Date column** " ...... For background, I'm trying to build a model using information that would have been available to me on that morning. So on the morning of 2014-03-31, i have no data available. Throughout the day, data is collected, and on 2014-04-01, data from 2014-03-31 is what i will have available to predict events on 2014-04-01 — exhoosier10, Jan 28 '15 at 16:59
Thanks for the clarification. I missed that on my initial read. — davechilders, Jan 28 '15 at 18:38

score 3 · Accepted Answer · answered Jan 28 '15 at 17:38

library(dplyr)
library(tidyr)

df$Date <- as.Date(df$Date)

Step 1: Generate a full list of {Date, Index} pairs

full_dat <- expand.grid(
  Date = date_range, 
  Index = indices,
  stringsAsFactors = FALSE
  ) %>% 
  arrange(Date, Index) %>%
  tbl_df

Step 2: Define a cumsum() function that ignores NA

cumsum2 <- function(x){

  x[is.na(x)] <- 0
  cumsum(x)

}

Step 3: Generate totals per {Date, Index}, join with full {Date, Index} data, and compute the lagged cumulative sum.

df %>%
  group_by(Date, Index) %>%
  summarise(
    totx = sum(Event == "x"),
    toty = sum(Event == "y")
    ) %>%
  right_join(full_dat, by = c("Date", "Index")) %>% 
  group_by(Index) %>%
  mutate(
    cumx = lag(cumsum2(totx)),
    cumy = lag(cumsum2(toty))
    ) %>%
  # some clean up.
  select(-starts_with("tot")) %>%
  mutate(
    cumx = ifelse(is.na(cumx), 0, cumx),
    cumy = ifelse(is.na(cumy), 0, cumy)
    )

brilliant. took < 10 seconds for these 200k rows. Appreciate your help! — exhoosier10, Jan 28 '15 at 18:39

score 1 · Answer 2 · answered Jan 28 '15 at 17:09

1

Would something like this using dplyr and tidyr work?

library(dplyr)
library(tidyr)

df %>%
  group_by(Date, Index, Event) %>%
  summarise(events = n()) %>%
  group_by(Index, Event) %>%
  mutate(cumsum_events = cumsum(events)) %>%
  select(-events) %>%
  spread(Event, cumsum_events) %>%
  rename(sum.x = x,
         sum.y = y)

#        Date Index sum.x sum.y
#1 2014-03-31     A     2     1
#2 2014-04-01     A     3     2
#3 2014-04-01     B     1    NA
#4 2014-04-02     B     2    NA
#5 2014-04-03     A     4    NA
#6 2014-09-30     B     3    NA

answered Jan 28 '15 at 17:09

maloneypatr

3,562
4
23
33

This is useful to summarize everything up, thanks.... My goal was to have one record for each unique Index (all indices) and Date (dates between 2014-03-31 and 2014-09-30) combo so that i can export that to an individual file for each day of all of the year's data up to that point. Some days we don't collect info on Index A, so if i subset by date = '2014-04-02' using this method, i'd only see data for Index B and not A. Is there any quick way to get that without adding additional nulled out records for each unique Date+Index key?? – exhoosier10 Jan 28 '15 at 17:28
@maloneypatr - I don't think your solution matches the OP's desired output. – davechilders Jan 28 '15 at 17:33

R: How to sum based on multiple criteria and summarize table

2 Answers2