16

I'm trying to learn R and there are a few things I've done for 10+ years in SAS that I cannot quite figure out the best way to do in R. Take this data:

 id  class           t  count  desired
 --  -----  ----------  -----  -------
  1      A  2010-01-15      1        1
  1      A  2010-02-15      2        3
  1      B  2010-04-15      3        3
  1      B  2010-09-15      4        4
  2      A  2010-01-15      5        5
  2      B  2010-06-15      6        6
  2      B  2010-08-15      7       13
  2      B  2010-09-15      8       21

I want to calculate the column desired as a rolling sum by id, class, and within a 4 months rolling window. Notice that not all months are present for each combination of id and class.

In SAS I'd typically do this in one of 2 ways:

  1. RETAIN plus a by id & class.
  2. PROC SQL with a left join from df as df1 to df as df2 on id, class and the df1.d-df2.d within the appropriate window

What is the best R approach to this type of problem?

t <- as.Date(c("2010-01-15","2010-02-15","2010-04-15","2010-09-15",
               "2010-01-15","2010-06-15","2010-08-15","2010-09-15"))
class <- c("A","A","B","B","A","B","B","B")
id <- c(1,1,1,1,2,2,2,2)
count <- seq(1,8,length.out=8)
desired <- c(1,3,3,4,5,6,13,21)
df <- data.frame(id,class,t,count,desired)
andrey_sz
  • 751
  • 1
  • 13
  • 29
ADJ
  • 4,892
  • 10
  • 50
  • 83
  • what's `d`? its definition isn't included in your setup code. – Matthew Plourde May 30 '13 at 15:37
  • Take a look at the `zoo` package. It can do these rolling summaries on time aligned data fairly easily. If you're comfortable with `sql`, you can use the `sqldf` package. – Justin May 30 '13 at 15:38
  • @MatthewPlourde, I think @ADJ mean `df <- data.frame(t,class, id, count ,desired)` – David Marx May 30 '13 at 15:41
  • One of the things with learning R from SAS (or the other way around) is to remember that they work sufficiently differently that things you need PROCs for in SAS (like PROC MEANS for a sum or mean across rows) are trivial to do in 'data step' R (ie, direct matrix programming); but the other way around as well. Things that are easy to do with the automatic looping through SAS rows are not easy to do with 'data step' R, but have packages (ie, procs) to handle. – Joe May 30 '13 at 15:54
  • if you're comfortable with `proc sql` then check [videos #027 and #028](http://www.twotorials.com/) for an intro to `sqldf` and `merge` with R :) – Anthony Damico May 30 '13 at 16:42
  • @Justin, this question finally made me interested enough to look into zoo, but it wasn't obvious to me how to use it here to get a rolling sum in the desired fashion, specifically how to handle either the missing values or the first few data points in the desired way. Any suggestions? – Aaron left Stack Overflow May 30 '13 at 21:07
  • @Aaron off hand, I don't know exactly. Generally, I find myself using `zoo` and `data.table` together for the `split-apply-combine` strategy when I have sparse dates... – Justin May 30 '13 at 21:16
  • Yes, sorry, I started with t and changed it to d... – ADJ May 31 '13 at 02:52
  • 1
    I'm already using the sqldf package. For one thing, I still haven't quite figured out how to use R functions within a sqldf query, the same way I could use one of many SAS functions within PROC SQL. In this case, my preferred solution would involve using an R version of SAS' INTCK function to do dates arithmetic in a more sophisticated way than described in my original example. But since I'm learning R from zero, I'd have a preference for learning how to do things the R way (see Joe's comment) – ADJ May 31 '13 at 02:59
  • Hi @ADJ, I'm glad my solution worked for you, but I'd really like to see if anyone has any better ideas. I'm going to change the title and add a bounty; if you have objection to my new title, feel free to change it back. – Aaron left Stack Overflow Jun 04 '13 at 15:56
  • 2
    Honestly, for stuff like this, I'd just keep using SAS. :) – Hong Ooi Jun 04 '13 at 16:06

4 Answers4

18

Here are a few solutions:

1) zoo Using ave, for each group create a monthly series, m, by merging the original series, z, with a grid, g. Then calculate the rolling sum and retain only the original time points:

library(zoo)
f <- function(i) { 
    z <- with(df[i, ], zoo(count, t))
    g <- zoo(, seq(start(z), end(z), by = "month"))
    m <- merge(z, g)
    window(rollapplyr(m, 4, sum, na.rm = TRUE, partial = TRUE), time(z))
}
df$desired <- ave(1:nrow(df), df$id, df$class, FUN = f)

which gives:

> df
  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note We have assumed the times are ordered within each group (as in the question). If that is not so then sort df first.

2) sqldf

library(sqldf)
sqldf("select id, class, a.t, a.'count', sum(b.'count') desired 
   from df a join df b 
   using(id, class) 
   where a.t - b.t between 0 and 100
   group by id, class, a.t")

which gives:

  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note: If the merge should be too large to fit into memory then use sqldf("...", dbname = tempfile()) to cause the intermediate results to be stored in a database which it creates on the fly and automatically destroys afterwards.

3) Base R The sqldf solution motivates this base R solution which just translates the SQL into R:

m <- merge(df, df, by = 1:2)
s <- subset(m, t.x - t.y >= 0 & t.x - t.y <= 100)
ag <- aggregate(count.y ~ t.x + class + id, s, sum)
names(ag) <- c("t", "class", "id", "count", "desired")

The result is:

> ag
           t class id count desired
1 2010-01-15     A  1     1       1
2 2010-02-15     A  1     2       3
3 2010-04-15     B  1     3       3
4 2010-09-15     B  1     4       4
5 2010-01-15     A  2     5       5
6 2010-06-15     B  2     6       6
7 2010-08-15     B  2     7      13
8 2010-09-15     B  2     8      21

Note: This does do a merge in memory which might be a problem if the data set is very large.

UPDATE: Minor simplifications of first solution and also added second solution.

UPDATE 2: Added third solution.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
5

I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.

This first uses zoo's as.yearmon to get the dates in terms of just month and year, then reshapes it to get one column for each id/class combination, then fills in with zeros before, after, and for missing months, then uses zoo to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
4

A farily efficient answer to this problem could be found using the data.table library.

##Utilize the data.table package
library("data.table")
data <- data.table(t,class,id,count,desired)[order(id,class)]

##Assign each customer an ID
data[,Cust_No:=.GRP,by=c("id","class")]

##Create "list" of comparison dates and values
Ref <- data[,list(Compare_Value=list(I(count)),Compare_Date=list(I(t))), by=c("id","class")]

##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
  d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
  sum((d <= 0 & d >= -124)*Ref$Compare_Value[[NUM]])
}, RD = data$t,NUM=data$Cust_No)

##Print out data
data <- data[,list(id,class,t,count,desired,Roll.Val)][order(id,class)]
data

id class          t count desired Roll.Val
1:  1     A 2010-01-15     1       1        1
2:  1     A 2010-02-15     2       3        3
3:  1     B 2010-04-15     3       3        3
4:  1     B 2010-09-15     4       4        4
5:  2     A 2010-01-15     5       5        5
6:  2     B 2010-06-15     6       6        6
7:  2     B 2010-08-15     7      13       13
8:  2     B 2010-09-15     8      21       21
Mike.Gahan
  • 4,565
  • 23
  • 39
  • This is a 124 days rolling time period. Obviosuly this isn't exactly 4 months, but the code can be easily modified. – Mike.Gahan Apr 17 '14 at 14:25
0

With runner package one can calculate everything on rolling windows. Below example of using sum_run

library(runner)
df %>%
  group_by(id) %>%
  mutate(
    output = sum_run(count, k = 30*4, idx = t)   
  )

# <dbl> <fct> <date>     <dbl>   <dbl>  <dbl>
#     1 A     2010-01-15     1       1      1
#     1 A     2010-02-15     2       3      3
#     1 B     2010-04-15     3       3      6
#     1 B     2010-09-15     4       4      4
#     2 A     2010-01-15     5       5      5
#     2 B     2010-06-15     6       6      6
#     2 B     2010-08-15     7      13     13
#     2 B     2010-09-15     8      21     21
GoGonzo
  • 2,637
  • 1
  • 18
  • 25