2

I'm working on a moving average calculator that works as intended, but just takes a long time to run the calc because it is currently looping through the formula instead of being vectorized. The data set is about ~16000 rows. The MA formula is written to filter out the top quartile of values appearing in the previous 45 days of usd_price. Any tips/changes to get this running more efficiently as a vector calc?

The dput output is:

    > dput(data)
structure(list(loan_price = c(50000, 60000, 40000, 35000, 1e+05, 
95000), cad_price = c(62500, 75000, 50000, 43750, 125000, 118750
), day = structure(c(1642118400, 1641772800, 1639958400, 1639785600, 
1638316800, 1640995200), tzone = "UTC", class = c("POSIXct", "POSIXt")), fourtyfive_avg = c(251435.529507523, 251435.529507523, 
251435.529507523, 251435.529507523, 251435.529507523, 251435.529507523
), Loan = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame")) 


library(readxl)
library(tidyverse)
library(zoo)
library(dplyr)

data<- data%>% mutate(day=lubridate::parse_date_time(day, "ymd"))

myfunc <- function(x){
  fourtyfive_days <- as.Date(x - ddays(45))
  
  data<-
    data %>%
    filter(day <= x) %>%
    filter(day >= fourtyfive_days) %>%
    filter(loan_price<= quantile(loan_price, probs = 0.75)) %>% 
    summarize(fourtyfive_avg = mean(loan_price))
  
    return(data$fourtyfive_avg)
}
data$fourtyfive_avg <- sapply(data$day, simplify = TRUE,  FUN = myfunc)
Ranalytictsnewb
  • 105
  • 1
  • 1
  • 12
  • I think the stats::filter function can help you because it is vectorised. See here https://stackoverflow.com/questions/743812/calculating-moving-average, you can tweek it a little to achieve what you need. To have a true moving average you can use sides argument (sides = 1). – Guillaume Jan 29 '22 at 13:31

1 Answers1

0

Functions in the tidyverse are notoriously slow so moving away from tidy functions will improve the speed significantly. You could also try the data.table package, but 16,000 rows isn’t a lot of data so I’m not sure how necessary this would be. I typically only use it when the number of rows in my data is in the millions. You also have to account for the time it takes to index, so I’ve included that in the benchmark below.

The functions we'll be comparing:

library(data.table)
library(tidyverse)
library(lubridate)
library(microbenchmark)

## dplyr
myfunc <- function(x){
  fourtyfive_days <- as.Date(x - ddays(45))
  
  data<-
    data %>%
    filter(day <= x) %>%
    filter(day >= fourtyfive_days) %>%
    filter(loan_price<= quantile(loan_price, probs = 0.75)) %>% 
    summarize(fourtyfive_avg = mean(loan_price))
  
    return(data$fourtyfive_avg)
}
# data$fourtyfive_avg <- sapply(data$day, simplify = TRUE,  FUN = myfunc)

## base
mybasefunc <- function(x){
  fortyfive_days <- as.Date(x - ddays(45))
  data <- data[data$day >= fortyfive_days & data$day <= x ,]
  q75 <- quantile(data$loan_price, probs = 0.75)
  data <- data[data$loan_price <= q75,]
  fortyfive_avg <- mean(data$loan_price)
  
  return(fortyfive_avg)
}
# data$fortyfive_avg <- sapply(data$day, simplify = TRUE,  FUN = mybasefunc)


## data.table
# dat <- data.table(data, key = c("day", "loan_price"))
mydtfunc <- function(x){
  
  fortyfive_days <- as.Date(x - ddays(45))
  dat <- dat[day >= fortyfive_days & day <= x]
  q75 <- quantile(dat$loan_price, probs = 0.75)
  dat <- dat[dat$loan_price <= q75]
  fortyfive_avg <- mean(dat$loan_price)

  return(fortyfive_avg)
}
# dat[ , fortyfive_avg := sapply(day, mydtfunc), ]

And benchmarking:

set.seed(1)
m <- 
  microbenchmark("dplyr" = {data$fourtyfive_avg <- sapply(data$day, simplify = TRUE,  FUN = myfunc)},
               "base" = {data$fortyfive_avg <- sapply(data$day, simplify = TRUE,  FUN = mybasefunc)},
               "dt" = {dat <- data.table(data, key = c("day", "loan_price")); dat[ , fortyfive_avg := sapply(day, mydtfunc), ]})

m
#> Unit: milliseconds
#>   expr     min       lq      mean   median       uq     max neval cld
#>  dplyr 29.6519 30.71400 32.594319 32.42595 33.68590 44.0838   100   c
#>   base  4.2316  4.37210  4.631541  4.42665  4.58615 12.3656   100 a  
#>     dt  5.6883  5.83755  6.254143  5.97100  6.11905 15.6615   100  b
autoplot(m)

The benchmark seems pretty conclusive: you can see a significant improvement by moving away from dplyr to either base or data.table. It's worth noting I don't use data.table often, so there may be a more efficient way to accomplish what I've done, but it's still much faster than dplyr.

Created on 2022-01-31 by the reprex package (v2.0.1)

TrainingPizza
  • 1,090
  • 3
  • 12