Let's say I have a dataset which resembles the one below:
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/17/2018 | Simon | 58 | 4260 | 8550 |
| 334 | 8/20/2018 | Simon | 3 | 79 | 4339 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 3/4/2019 | Wang | NA | 1700 | 0 |
| 5006 | 3/7/2019 | Wang | 3 | 335 | 2035 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |
I need to get the transactions whose sum of the amount for consecutive rows >=5000 for the same Buyer and same id but the ones that are within a difference of 5 days (<=5 days). For example in the above dataset, Simon has transactions on 6/15/2018 and 6/20/2018 within a difference of 5 days and the ConsecutiveSum is also >=5000 whereas for the transactions done on 8/17/2018 and 8/20/2018 are also within 5 days difference but the ConsecutiveSum is not greater than or equal to 5000 (So, I don't want these transactions in the output). Also, the transactions done by Wang on 5/5/2019 and 5/9/2019 are within 5 days difference, but I'm able to get only the transaction on 5/9/2019 and not 5/5/2019 based on this post Calculate sum of a column if the difference between consecutive rows meets a condition. How can I restructure the code to include such transactions?
Below is the code followed:
df <- data.frame(id = c("334","334","334","334","334","334","5006","5006","5006","5006","5006","5006"),
Date = c("6/15/2018","6/20/2018","8/17/2018","8/20/2019","8/7/2018","8/9/2018","3/4/2019",
"3/7/2019","5/5/2019","5/9/2019","5/14/2019","5/17/2019"),
Buyer = c("Simon", "Simon", "Simon", "Simon", "Chang", "Chang", "Chang", "Chang", "Chang",
"Chang","Chang","Chang"),
diff = c("NA","5","58","3","NA","2","NA","3","59","4","5","3"),
Amount = c("1948","4290","4260","79","2145","4192","1700","335","4400","2700","4355","3100"),
ConsecutiveSum = c("0","6238","8550","4339","0","6337","0","2035","4735","7100","7055","7455"),stringsAsFactors = F)
df$Date <- as.Date(df$Date, '%m/%d/%Y')
df$Amount <- as.numeric(df$Amount)
df$diff <- as.numeric(df$diff)
df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum)
df_sum = df %>% group_by(Buyer,id) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0))) %>%
filter(diff<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
My expected output should be like the one below:
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |