I have a file (appx 9K records) that I want to aggregate based on the group first, and then on dates that are within seven days of each other. However, I'm not understanding why the results look the way they do. I realize there are other ways I could achieve the same results with this particular example, but it's going to be much more complicated and there are other reasons I'm interested in using tibbletime. Here's a reproducible example:
library(tidyverse)
library(lubridate)
library(tibbletime) #devtools::install_github("business-science/tibbletime")
TEST_ROLL <- as_tibble(list(
CITY_ID = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2",
"2", "2", "2"),
CAFE_ID = c("1001", "1001", "1001", "1001", "2002", "2002", "2002", "2002",
"3003", "3003", "3003", "3003", "4004", "4004", "4005", "4005"),
HEALTH_REPORT_ID = c("1425", "1532", "1666", "1989", "2166", "2457", "2776",
"2999", "3409", "3497", "3595", "3786", "4105", "4511", "4567", "4580"),
REPORT_SUBMIT_DATE = ymd( c( "2013-05-26", "2013-05-27", "2013-05-31",
"2013-05-31", "2016-05-26", "2016-05-27", "2016-05-31", "2016-05-31", "2018-
05-26", "2018-05-27", "2018-05-31", "2018-05-31", "2017-01-01", "2017-01-
05", "2017-02-04", "2017-02-10"))))
What I want to do is, starting at the first report for each cafe, count the health report submissions that occur within seven days of each other as only one report so they're not being over counted.
First I tried using "7 day" as the period:
TEST_ROLL %>%
group_by(CAFE_ID) %>%
as_tbl_time(REPORT_SUBMIT_DATE) %>%
mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "7 day"))
# A time tibble: 16 x 5
# Index: REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE
<chr> <chr> <chr> <date> <date>
1 1 1001 1425 2013-05-26 2013-05-27
2 1 1001 1532 2013-05-27 2013-05-27
3 1 1001 1666 2013-05-31 2013-05-31
4 1 1001 1989 2013-05-31 2013-05-31
5 1 2002 2166 2016-05-26 2016-05-27
6 1 2002 2457 2016-05-27 2016-05-27
7 1 2002 2776 2016-05-31 2016-05-31
8 1 2002 2999 2016-05-31 2016-05-31
9 1 3003 3409 2018-05-26 2018-05-27
10 1 3003 3497 2018-05-27 2018-05-27
11 1 3003 3595 2018-05-31 2018-05-31
12 1 3003 3786 2018-05-31 2018-05-31
13 2 4004 4105 2017-01-01 2017-01-05
14 2 4004 4511 2017-01-05 2017-01-05
15 2 4005 4567 2017-02-04 2017-02-04
16 2 4005 4580 2017-02-10 2017-02-10
Which is not what I want. If it worked, all four of the reports for cafe 1001 would have the same roll date, because they are all within 7 days. So why the split into two dates in the result column?
Just playing around with it I tried using "weekly" instead of "7 days" and then I get this result:
TEST_ROLL %>%
group_by(CAFE_ID) %>%
as_tbl_time(REPORT_SUBMIT_DATE) %>%
mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "weekly"))
# A time tibble: 16 x 5
# Index: REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE
<chr> <chr> <chr> <date> <date>
1 1 1001 1425 2013-05-26 2013-05-31
2 1 1001 1532 2013-05-27 2013-05-31
3 1 1001 1666 2013-05-31 2013-05-31
4 1 1001 1989 2013-05-31 2013-05-31
5 1 2002 2166 2016-05-26 2016-05-27
6 1 2002 2457 2016-05-27 2016-05-27
7 1 2002 2776 2016-05-31 2016-05-31
8 1 2002 2999 2016-05-31 2016-05-31
9 1 3003 3409 2018-05-26 2018-05-26
10 1 3003 3497 2018-05-27 2018-05-31
11 1 3003 3595 2018-05-31 2018-05-31
12 1 3003 3786 2018-05-31 2018-05-31
13 2 4004 4105 2017-01-01 2017-01-05
14 2 4004 4511 2017-01-05 2017-01-05
15 2 4005 4567 2017-02-04 2017-02-04
16 2 4005 4580 2017-02-10 2017-02-10
Cafe 1001 is exactly what I wanted, but cafe 2002 and 3003 have the same dates (different year), yet the result is different.
And cafe 4004 is combined how I want, but there are only 6 days in between for cafe 4005, so those should have been combined as well. (I will summarise/count them later)
Any ideas why this might be happening? Thanks!!