Alternatives to MAPPLY in R

Question

I have a data frame with the following:

1) Store 2) DayOfWeek
3) Date 4) Sales
5) Customers
6) Open 7) Promo
8) StateHoliday 9) SchoolHoliday
10) StoreType
11) Assortment
12) CompetitionDistance 13) CompetitionOpenSinceMonth
14) CompetitionOpenSinceYear
15) Promo2
16) Promo2SinceWeek 17) Promo2SinceYear 18) PromoInterval
19) CompanyDistanceBin
20) CompetitionOpenSinceDate
21) DaysSinceCompetionOpen

I am trying to calculate the Average Sales for the Previous Quarter based on the date (basically date - 3 months). But, I need to also subset based on DayOfWeek and Promo. I have written a function and am using mapply.

quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{   
    storeDate = as.Date(storeDate,"%Y-%m-%d")
    EndDate = ymd(as.Date(storeDate)) + ddays(-1)
    EndDate = as.Date(storeDate,"%Y-%m-%d")
    StartDate = ymd(storeDate + months(-3))
    StartDate = as.Date(StartDate)

    quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
    quarterSales = 0
    salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))  

    if (nrow(salesDf)>0)
      quarterSales = as.numeric(round(salesDf$avgSales,digits=0))     

    return(quarterSales)
}

saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)

 head(exampleset)
           Store         DayOfWeek Date               Sales           Promo
186            1                3  2013-06-05         5012            1
296            1                3  2013-04-10         4903            1
337            1                3  2013-05-29         5784            1
425            1                3  2013-05-08         5230            0
449            1                3  2013-04-03         4625            0
477            1                3  2013-03-27         6660            1

saleswithstore is a dataframe that has 1,000,000 rows. So, this solution is not workable because it performing badly and taking forever. Is there a better, more efficient way to have a specific subset on a dataframe like this and then and then take an average like I am trying to do here?

I am open to any suggestions. I admittedly am new to R.

The bottleneck here would probably be the multiple assignments of dataframes to signals. you try dplyr with its group_by function? Alternatively a data.table might do the job — Robin Gertenbach, Jul 01 '16 at 15:25
Some notes: (1) move the "Date" manipulation functions out of your function and use `as.Date` etc on your columns as a whole (as these functions accept vectors of `length > 1`) and not `mapply` in each element, (2) the way you create `quarterStoresales` suggests for an operation grouped by `c(Store, DayOfWeek, Promo)` and R provides built-in functionality and packages for these operations. Also, could you provide an [example dataset](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with only the nesseccary columns and the expected output? — alexis_laz, Jul 01 '16 at 15:32
Alexis, I have added the exampleset to my originial post and the following QuarterStartDate and QuarterEndDate to the dataframe outside the function. saleswithstore$QuarterStartDate <- mondate(saleswithstore$Date) - 3 saleswithstore$QuarterEndDate = as.Date(ymd(saleswithstore$Date + ddays(-1))) Is there a way within the ddply to subset something like: Date >= QuarterStartDate and Date <= QuarterEndDate. Then I could just group on the rest. — maubin0316, Jul 01 '16 at 17:52

score 1 · Answer 1 · answered Jul 01 '16 at 18:14

@maubin0316, your intuition is right in the comment that you can just group by the rest of the variables. I put together this example using data.table

library(data.table)
set.seed(343)

# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
                 'DayOfWeek' = sample(1:7, 100, replace=T),
                 'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
                 'Sales' = sample(1000:10000, 100),
                 'Promo' = sample(c(0,1), 10, replace=T))

QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')

# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
  # Limit between your dates, group by your variables of interest
  data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
  # Sort in an order that makes sense
  data <- data[order(Store, DayOfWeek, Promo)]
  return(data)
}

salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary

Alternatives to MAPPLY in R

1 Answers1