Calculating sum based on differences of dates by grouping 2 or more columns

Question

Let's say I have a dataset which resembles the one below:

| id   |    Date   | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334  | 6/15/2018 | Simon | NA   | 1948   | 0              |
| 334  | 6/20/2018 | Simon | 5    | 4290   | 6238           |
| 334  | 8/17/2018 | Simon | 58   | 4260   | 8550           |
| 334  | 8/20/2018 | Simon | 3    | 79     | 4339           |
| 334  | 8/7/2018  | Wang  | NA   | 2145   | 0              |
| 334  | 8/9/2018  | Wang  | 2    | 4192   | 6337           |
| 5006 | 3/4/2019  | Wang  | NA   | 1700   | 0              |
| 5006 | 3/7/2019  | Wang  | 3    | 335    | 2035           |
| 5006 | 5/5/2019  | Wang  | 59   | 4400   | 4735           |
| 5006 | 5/9/2019  | Wang  | 4    | 2700   | 7100           |
| 5006 | 5/14/2019 | Wang  | 5    | 4355   | 7055           |
| 5006 | 5/17/2019 | Wang  | 3    | 3100   | 7455           |

I need to get the transactions whose sum of the amount for consecutive rows >=5000 for the same Buyer and same id but the ones that are within a difference of 5 days (<=5 days). For example in the above dataset, Simon has transactions on 6/15/2018 and 6/20/2018 within a difference of 5 days and the ConsecutiveSum is also >=5000 whereas for the transactions done on 8/17/2018 and 8/20/2018 are also within 5 days difference but the ConsecutiveSum is not greater than or equal to 5000 (So, I don't want these transactions in the output). Also, the transactions done by Wang on 5/5/2019 and 5/9/2019 are within 5 days difference, but I'm able to get only the transaction on 5/9/2019 and not 5/5/2019 based on this post Calculate sum of a column if the difference between consecutive rows meets a condition. How can I restructure the code to include such transactions?

Below is the code followed:

df <- data.frame(id = c("334","334","334","334","334","334","5006","5006","5006","5006","5006","5006"),
      Date = c("6/15/2018","6/20/2018","8/17/2018","8/20/2019","8/7/2018","8/9/2018","3/4/2019",
             "3/7/2019","5/5/2019","5/9/2019","5/14/2019","5/17/2019"), 
      Buyer = c("Simon", "Simon", "Simon", "Simon", "Chang", "Chang", "Chang", "Chang", "Chang",
              "Chang","Chang","Chang"), 
      diff = c("NA","5","58","3","NA","2","NA","3","59","4","5","3"),
      Amount = c("1948","4290","4260","79","2145","4192","1700","335","4400","2700","4355","3100"), 
      ConsecutiveSum = c("0","6238","8550","4339","0","6337","0","2035","4735","7100","7055","7455"),stringsAsFactors = F)  

df$Date <- as.Date(df$Date, '%m/%d/%Y')
df$Amount <- as.numeric(df$Amount)
df$diff <- as.numeric(df$diff)
df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum)

df_sum = df %>% group_by(Buyer,id) %>%
         mutate(rank=dense_rank(Date)) %>%
         mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount  + lag(Amount , default = 0))) %>% 
         filter(diff<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)

My expected output should be like the one below:

| id   |    Date   | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334  | 6/15/2018 | Simon | NA   | 1948   | 0              |
| 334  | 6/20/2018 | Simon | 5    | 4290   | 6238           |
| 334  |  8/7/2018 |  Wang | NA   | 2145   | 0              |
| 334  | 8/9/2018  | Wang  | 2    | 4192   | 6337           |
| 5006 | 5/5/2019  | Wang  | 59   | 4400   | 4735           |
| 5006 | 5/9/2019  | Wang  | 4    | 2700   | 7100           |
| 5006 | 5/14/2019 | Wang  | 5    | 4355   | 7055           |
| 5006 | 5/17/2019 | Wang  | 3    | 3100   | 7455           |

@akrun I think so because it would rank in different ways in which each buyer has ordered an item — hk2, Sep 18 '19 at 19:31
In your expected output the line 5 has a `8592` value in `ConsecutiveSum`. Where did it come from? — PavoDive, Sep 19 '19 at 02:16

score 1 · Accepted Answer · answered Sep 19 '19 at 01:31

Here is a possibility using hidden variables keep1 and keep2. First repeat all the lines in your example until df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum) and then:

df %>% replace_na(list(diff=0)) %>% 
    mutate(keep1=ifelse((ConsecutiveSum>=5000 & diff<=5), 1, 0)) %>% 
    mutate(keep2=ifelse(lead(keep1)==1, 1, 0)) %>% 
    filter(keep1==1|keep2==1) %>% select(-keep1,-keep2)

The result is:

    id       Date Buyer diff Amount ConsecutiveSum
1  334 2018-06-15 Simon    0   1948              0
2  334 2018-06-20 Simon    5   4290           6238
3  334 2018-08-07 Chang    0   2145              0
4  334 2018-08-09 Chang    2   4192           6337
5 5006 2019-05-05 Chang   59   4400           4735
6 5006 2019-05-09 Chang    4   2700           7100
7 5006 2019-05-14 Chang    5   4355           7055
8 5006 2019-05-17 Chang    3   3100           7455

score 0 · Answer 2 · answered Sep 19 '19 at 02:24

Following what you want, I came up with a very simple idea that does indeed follow your logic and provides the expected result (please notice that row 5 of your expected result doesn't come from the toy data.frame provided).

library(data.table)

setDT(df)

# create a column with day differences between consecutive dates of Buyer AND id:

df[, lagdays := c(NA, diff(Date)), by = .(id, Buyer)]

# Filter the cases in which: lagdays are either less than 5 or NA (first row in a Buyer-id combination) AND consecutiveSum is either greater than 5000 OR 0 (first row in a buyer-id combination).

# lagdays := NULL removes the helper variable

df[(lagdays <= 5 | is.na(lagdays)) & (ConsecutiveSum == 0 | ConsecutiveSum >= 5000), ][, lagdays := NULL][]

     id       Date Buyer diff Amount ConsecutiveSum
1:  334 2018-06-15 Simon   NA   1948              0
2:  334 2018-06-20 Simon    5   4290           6238
3:  334 2018-08-07 Chang   NA   2145              0
4:  334 2018-08-09 Chang    2   4192           6337
5: 5006 2019-03-04 Chang   NA   1700              0
6: 5006 2019-05-09 Chang    4   2700           7100
7: 5006 2019-05-14 Chang    5   4355           7055
8: 5006 2019-05-17 Chang    3   3100           7455

Calculating sum based on differences of dates by grouping 2 or more columns

2 Answers2