How to Aggregate Dates By Difference And Average Distance in Days?

Question

I have a database of cash register transactions. The records are split by Products in a Basket:

     Date    Hour  Cust  Prod Basket Spend
1| 20160416    8    C1    P1    B2     10
2| 20160416    8    C1    P2    B2     20
3| 20160115   15    C1    P3    B1     30
4| 20160115   15    C1    P2    B1     50
5| 20161023   11    C1    P4    B3     60

I would like to see:

DaysSinceLastVisit  Cust Basket Spend
      NULL           C1    B1     30
        92           C1    B2     80
       190           C1    B3     60

AND

AvgDaysBetweenVisits Cust AvgSpent
          141         C1    56.57

I can't figure out how to perform aggregate functions on Dates during a GROUP BY. All the other posts on SO seem to have 2 for start/end dates [1] [2] [3].

Here's what I have tried so far:

SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function

Also, it should be noted that I'm running this on r with sqldf, which uses SQLite syntax. However, I'd prefer an SQLite solution.

Please provide 1. the output of the commands you ran and 2. a `dput()` of the sample data. I believe the problem is probably the class and format of the date column. Note also that you could skip the SQL and do this with `data.table` if you wanted. — Hack-R, Dec 02 '16 at 23:26
@Hack-R I didn't add a dput because the data set is too big to fit, even with droplevels and head: http://stackoverflow.com/questions/40919844/how-to-make-dput-remove-superfluous-data — Travis Heeter, Dec 03 '16 at 00:00

score 1 · Accepted Answer · answered Dec 02 '16 at 23:36

Try this-

df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
             "Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
             "Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
             "Spend"=c(10,20,30,50,60))

df$Date <- as.Date(df$Date, format = "%Y%m%d")

# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)

# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))

# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)

score 1 · Answer 2 · 2016-12-03T06:47:34.597

day_since_last_visit is with respect to today's date+time , as it is more practical. However if you get the difference btween 1st and 2nd and 2nd and 3rd, it will be 92 and 190, which is similar to your data. Best way to handle that part will be in cursor, can be done in query too, but will be bit more complex..

   select   round( julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  ) ,2 )      days_since_last_visit,
           date, cust, basket, sum(spend) total_spend 
     from customer
 group by  cust, basket, date

Average for date visited and today's date for each record

   select  round(avg( julian_days) ,2)  average_days , cust,   round(avg(total_spend) ,2)  average_spent
     from 
           ( select   julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  )      julian_days, date,
                      cust, basket, sum(spend) total_spend
               from customer
           group by  cust, basket, date )
 group by cust

create and insert script only for reference

 create table customer ( date text , hour  integer, cust text, prod text, basket text, spend integer )

 insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)

G. Grothendieck · Answer 3 · 2016-12-04T23:07:57.837

This uses SQLite via sqldf as in the question.

We first define three tables (only for the duration of the SQL statement) in the with clause:

aa is table a with an additional julian date column suitable for differencing
tab_days is a table that uses aa for defining the differenced days via an appropriately aggregated join
tab_sum_spend is a table that constains the Spend sums

Finally we join the last two and sort appropriately.

library(sqldf) 
# see note at end for a in reproducible form

t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
                             substr(Date, 5, 2) || '-' ||
                             substr(Date, 7, 2)) juldate, 
                   * 
            FROM a),     
     tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
                  FROM   aa a1
                          LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
                  GROUP  BY a1.Cust, a1.Date, a1.Basket),
     tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
                       FROM   aa
                       GROUP  BY Cust, Date, Basket) 
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER  BY Cust, Date, Basket
")
t1

##    Days Cust Basket Spend
## 1  <NA>   C1     B1    80
## 2  92.0   C1     B2    30
## 3 190.0   C1     B3    60

and for the second question:

sqldf("SELECT avg(Days)  AvgDays, Cust, avg(Spend) AvgSpend FROM   t1")
##   AvgDays Cust AvgSpend
## 1     141   C1 56.66667

Note: The data.frame a in reproducible form is:

Lines <- "Date Hour Cust Prod Basket Spend
1 20160416    8   C1   P1     B2    10
2 20160416    8   C1   P2     B2    20
3 20160115   15   C1   P3     B1    30
4 20160115   15   C1   P2     B1    50
5 20161023   11   C1   P4     B3    60"
a <- read.table(text = Lines, as.is = TRUE)

How to Aggregate Dates By Difference And Average Distance in Days?

3 Answers3