4

This is what my data frame looks like:

library(data.table)

df <- fread('
                Name  EventType  Date  SalesAmount RunningTotal Runningtotal(prior365Days)
                John    Email      1/1/2014      0          0            0
                John    Sale       2/1/2014     10          10           10
                John    Sale       7/1/2014     20          30           30
                John    Sale       4/1/2015     30          60           50 
                John    Webinar    5/1/2015      0          60           50
                Tom     Email      1/1/2014      0          0            0
                Tom     Sale       2/1/2014     15          15           15
                Tom     Sale       7/1/2014     10          25           25
                Tom     Sale       4/1/2015     25          50           35 
                Tom     Webinar    5/1/2015      0          50           35
                ')
    df[,Date:= as.Date(Date, format="%m/%d/%Y")]

The last column was my desired column which is the cumulative sum of SalesAmount(for each Name) in the last 365 days rolling window and I performed this with the help of @6pool. His solution was:

df$EventDate <- as.Date(df$EventDate, format="%d/%m/%Y")
df <- df %>%
   group_by (Name) %>%
   arrange(EventDate) %>% 
   mutate(day = EventDate - EventDate[1])

f <- Vectorize(function(i)
    sum(df[df$Name[i] == df$Name & df$day[i] - df$day >= 0 & 
             df$day[i] - df$day <= 365, "SalesAmount"]), vec="i")
df$RunningTotal365 <- f(1:nrow(df))

However,df$RunningTotal365 <- f(1:nrow(df)) is taking a long time(over 1.5 days so far) as my dataframe is over 1.5 million rows. I was suggested "rollapply" in my initial question but I have struggled to figure out how to use it in this instance. Kindly help.

gibbz00
  • 1,947
  • 1
  • 19
  • 31
  • I wonder if you've tried implementing something of [this](http://stackoverflow.com/questions/30443844/extracting-event-types-from-last-21-day-window) – David Arenburg Jun 02 '15 at 15:37
  • @DavidArenburg Hi David, I implemented your solution(from the link) to a different problem. Cumulative running total in the last 365 day window for a given Name is a totally separate problem. – gibbz00 Jun 02 '15 at 15:41
  • Yeah, I meant I wonder if you tried modifying it in order to solve you new problem. This would be probably the fastest solution by far – David Arenburg Jun 02 '15 at 15:42
  • @DavidArenburg Thanks David, I never really thought of that. I was wondering what will be the two temporary datasets in this case that I can join later. Any help is much appreciated! – gibbz00 Jun 02 '15 at 15:47
  • I feel like you accepted an answer awfully quick. A 30 second runtime is great, but there's also an Rcpproll package; someone familiar with its (probably faster) methods might now be disinclined to answer. (Not me; I just found it by googling.) – Frank Jun 02 '15 at 17:15

3 Answers3

4

Give this a try:

DF <- read.table(text = "Name  EventType  EventDate  SalesAmount RunningTotal Runningtotal(prior365Days)
John    Email      1/1/2014      0          0            0
John    Sale       2/1/2014     10          10           10
John    Sale       7/1/2014     20          30           30
John    Sale       4/1/2015     30          60           50 
John    Webinar    5/1/2015      0          60           50
Tom     Email      1/1/2014      0          0            0
Tom     Sale       2/1/2014     15          15           15
Tom     Sale       7/1/2014     10          25           25
Tom     Sale       4/1/2015     25          50           35 
Tom     Webinar    5/1/2015      0          50           35", header = TRUE)


fun <- function(x, date, thresh) {
  D <- as.matrix(dist(date)) #distance matrix between dates
  D <- D <= thresh
  D[lower.tri(D)] <- FALSE #don't sum to future
  R <- D * x #FALSE is treated as 0
  colSums(R)
}


library(data.table)
setDT(DF)
DF[, EventDate := as.Date(EventDate, format = "%m/%d/%Y")]
setkey(DF, Name, EventDate)

DF[, RT365 := fun(SalesAmount, EventDate, 365), by = Name]

#    Name EventType  EventDate SalesAmount RunningTotal Runningtotal.prior365Days. RT365
# 1: John     Email 2014-01-01           0            0                          0     0
# 2: John      Sale 2014-02-01          10           10                         10    10
# 3: John      Sale 2014-07-01          20           30                         30    30
# 4: John      Sale 2015-04-01          30           60                         50    50
# 5: John   Webinar 2015-05-01           0           60                         50    50
# 6:  Tom     Email 2014-01-01           0            0                          0     0
# 7:  Tom      Sale 2014-02-01          15           15                         15    15
# 8:  Tom      Sale 2014-07-01          10           25                         25    25
# 9:  Tom      Sale 2015-04-01          25           50                         35    35
#10:  Tom   Webinar 2015-05-01           0           50                         35    35
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Is it feasible for 1.5million rows? – ExperimenteR Jun 02 '15 at 15:57
  • @ExperimenteR 1.5MM is a very small data set. Any vectorised code can handle it easily. And by "vectorized" I don't mean the "Vectorize" function. – David Arenburg Jun 02 '15 at 15:59
  • @DavidArenburg D <- as.matrix(dist(date)) is about of size 1.5M by 1.5M. I have no idea how much memory that takes. – ExperimenteR Jun 02 '15 at 16:11
  • @DavidArenburg Thank you so much! This worked like magic Roland. It took less than 30 seconds. Can one of you kindly shed some light as to why the initial solution that I employed is so slow? – gibbz00 Jun 02 '15 at 16:35
  • That's what's called a vectorized code compared to the for loop that for some reason was called `Vectorize` by the R devs. See "Circle 3" in [that book](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) – David Arenburg Jun 02 '15 at 17:05
  • @DavidArenburg Thanks you David. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution if I want the cumulative sum only for Fund A for example? – gibbz00 Jun 02 '15 at 17:08
  • @gibbz00 Hm, I'm not sure. If you pass `SalesAmount*(Fund=="A")` to Roland's `fun` instead of `SalesAmount`, it seems like that should be the correct cumulative sum, since it will have zeros wherever the Fund is not A. – Frank Jun 02 '15 at 17:30
  • 1
    @ExperimenteR The function is obviously not suitable for really long input vectors. However, I would have been very surprised if there were time series longer than a few years per user in the dataset and sales usually are much less frequent than daily. – Roland Jun 02 '15 at 17:44
  • @Frank R <- D * x * (df$Fund == "A") gives me " longer object length is not a multiple of shorter object length error". Kindly help. – gibbz00 Jun 02 '15 at 18:49
  • @gibbz00 I don't really follow. I meant `DF[, newvar := fun(SalesAmount*(Fund=="A"), EventDate, 365), by = Name]` If that's not what you were looking for and Roland can't answer it briefly, you might need to form a new question. – Frank Jun 02 '15 at 18:52
4

Here's an approach using foverlaps function from data.table package:

require(data.table)
setDT(df)[, end := as.Date(EventDate, format="%d/%m/%Y")
        ][, start := end - 365L]
setkey(df, Name, start, end)
olaps = foverlaps(df, df, nomatch=0L, which=TRUE)
olaps = olaps[xid >= yid, .(ans = sum(dt$SalesAmount[yid])), by=xid]

df[olaps$xid, Runningtotal := olaps$ans]

You can remove the start and end columns, if necessary, by doing:

df[, c("start", "end") := NULL]

Would be nice to know how fast/slow it is..

Arun
  • 116,683
  • 26
  • 284
  • 387
  • It is very fast. It took me less than 45 seconds. Thanks you Arun. However, I am strugging with extending the solutions to a minor problem. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution or your solution if I want the cumulative sum only for Fund A for example? – gibbz00 Jun 02 '15 at 19:17
  • 1. Could you provide link to your bigger data (if you can't share, could you generate an artificial data with similar settings)? 2. What extension are you struggling with? – Arun Jun 02 '15 at 19:19
  • Please edit your post with this, and show the desired output (or better yet, as a new post as it seems different to this Q). Could you share the bigger data? I'd like to see if there are any improvements possible. – Arun Jun 02 '15 at 19:22
  • @Frank I got the solution from Frank. Thank you guys a lot. – gibbz00 Jun 02 '15 at 19:54
  • 2
    @gibbz00 So, Arun's solution (less than 45 seconds according to you) is slower than my solution (30 seconds according to you)? Can you provide exact timings (use `system.time`)? Can you tell us how many unique names you have in your data? – Roland Jun 03 '15 at 05:35
  • @Roland I had about 15K unique names and 1.5 million rows with a lot more event types than sale,webinars and emails. I modified my data a lot after I implemented the solution(including subsetting) and did not keep the original dataframe object. Next time, I will try to use system.time rather than my subjective measurement of the clock. – gibbz00 Jun 03 '15 at 16:18
0

Using newer non-equi joins feature in data.table:

    df1 = df[.(iName=Name,start = Date - 365L, end = Date),
    on=.(Name=iName,Date >= start, Date <= end),nomatch = 0, allow.cart=TRUE][,
  .(MyTotal = sum(SalesAmount)), by=.(Name,Date = Date.1)]


    df[df1, on = .(Name,Date)]
gibbz00
  • 1,947
  • 1
  • 19
  • 31