1

I have a file (appx 9K records) that I want to aggregate based on the group first, and then on dates that are within seven days of each other. However, I'm not understanding why the results look the way they do. I realize there are other ways I could achieve the same results with this particular example, but it's going to be much more complicated and there are other reasons I'm interested in using tibbletime. Here's a reproducible example:

library(tidyverse)
library(lubridate)  
library(tibbletime) #devtools::install_github("business-science/tibbletime")

TEST_ROLL <- as_tibble(list(
CITY_ID = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", 
"2", "2", "2"),
CAFE_ID = c("1001", "1001", "1001", "1001", "2002", "2002", "2002", "2002", 
"3003", "3003", "3003", "3003", "4004", "4004", "4005", "4005"),
HEALTH_REPORT_ID = c("1425", "1532", "1666", "1989", "2166", "2457", "2776", 
"2999", "3409", "3497", "3595", "3786", "4105", "4511", "4567", "4580"),
REPORT_SUBMIT_DATE = ymd( c( "2013-05-26", "2013-05-27", "2013-05-31", 
"2013-05-31", "2016-05-26", "2016-05-27", "2016-05-31", "2016-05-31", "2018- 
05-26", "2018-05-27", "2018-05-31", "2018-05-31", "2017-01-01", "2017-01- 
05", "2017-02-04", "2017-02-10"))))

What I want to do is, starting at the first report for each cafe, count the health report submissions that occur within seven days of each other as only one report so they're not being over counted.

First I tried using "7 day" as the period:

TEST_ROLL %>% 
  group_by(CAFE_ID) %>% 
  as_tbl_time(REPORT_SUBMIT_DATE) %>% 
  mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "7 day"))

# A time tibble: 16 x 5
# Index:  REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE 
<chr>   <chr>   <chr>            <date>             <date>    
 1 1       1001    1425             2013-05-26         2013-05-27
 2 1       1001    1532             2013-05-27         2013-05-27
 3 1       1001    1666             2013-05-31         2013-05-31
 4 1       1001    1989             2013-05-31         2013-05-31
 5 1       2002    2166             2016-05-26         2016-05-27
 6 1       2002    2457             2016-05-27         2016-05-27
 7 1       2002    2776             2016-05-31         2016-05-31
 8 1       2002    2999             2016-05-31         2016-05-31
 9 1       3003    3409             2018-05-26         2018-05-27
10 1       3003    3497             2018-05-27         2018-05-27
11 1       3003    3595             2018-05-31         2018-05-31
12 1       3003    3786             2018-05-31         2018-05-31
13 2       4004    4105             2017-01-01         2017-01-05
14 2       4004    4511             2017-01-05         2017-01-05
15 2       4005    4567             2017-02-04         2017-02-04
16 2       4005    4580             2017-02-10         2017-02-10

Which is not what I want. If it worked, all four of the reports for cafe 1001 would have the same roll date, because they are all within 7 days. So why the split into two dates in the result column?

Just playing around with it I tried using "weekly" instead of "7 days" and then I get this result:

TEST_ROLL %>% 
  group_by(CAFE_ID) %>% 
  as_tbl_time(REPORT_SUBMIT_DATE) %>%
  mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "weekly"))

# A time tibble: 16 x 5
# Index:  REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
   CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE 
   <chr>   <chr>   <chr>            <date>             <date>    
 1 1       1001    1425             2013-05-26         2013-05-31
 2 1       1001    1532             2013-05-27         2013-05-31
 3 1       1001    1666             2013-05-31         2013-05-31
 4 1       1001    1989             2013-05-31         2013-05-31
 5 1       2002    2166             2016-05-26         2016-05-27
 6 1       2002    2457             2016-05-27         2016-05-27
 7 1       2002    2776             2016-05-31         2016-05-31
 8 1       2002    2999             2016-05-31         2016-05-31
 9 1       3003    3409             2018-05-26         2018-05-26
10 1       3003    3497             2018-05-27         2018-05-31
11 1       3003    3595             2018-05-31         2018-05-31
12 1       3003    3786             2018-05-31         2018-05-31
13 2       4004    4105             2017-01-01         2017-01-05
14 2       4004    4511             2017-01-05         2017-01-05
15 2       4005    4567             2017-02-04         2017-02-04
16 2       4005    4580             2017-02-10         2017-02-10

Cafe 1001 is exactly what I wanted, but cafe 2002 and 3003 have the same dates (different year), yet the result is different.

And cafe 4004 is combined how I want, but there are only 6 days in between for cafe 4005, so those should have been combined as well. (I will summarise/count them later)

Any ideas why this might be happening? Thanks!!

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
kpress
  • 136
  • 6
  • `collapse_index(REPORT_SUBMIT_DATE, "weekly")` collapses dates that belong to the same week which starts on a Sunday. IMHO, though, it will be much easier to just accept the default behavior of `collapse_index` than what you are trying to achieve. – hpesoj626 Apr 06 '18 at 01:39
  • Yeah I've since realized why the "weekly" is behaving this way, but it's interesting that using "7 days" still doesn't get the result I want. I actually tried 8 days, and that seems to work on my larger dataset, although I'm still case checking. – kpress Apr 06 '18 at 02:05
  • See my answer if it helps, @Knachman – hpesoj626 Apr 06 '18 at 07:23
  • The 7-day period is important, I don't care at all about calendar weeks.This is sort of the first step in the analysis, and I will have to then group by date and identify those reports that are within 45 days of the first report date, then after the report date as outcomes, and before the report date as prior history. – kpress Apr 06 '18 at 15:23
  • I wonder if this is happening because of the way collapse_index uses a "start date", if it is calculating every 7 day period based on the first date in the index that might not work with what I'm trying to do. – kpress Apr 06 '18 at 16:08

1 Answers1

0

I don't know if what you are trying to accomplish is a wise choice. I think that going with the default behavior of collapse_index(REPORT_SUBMIT_DATE, "weekly") is the sane thing to do.

Here is one approach if you want to still go with what you are trying to do, though. I think you need to have a very good overview of which days are within 7 days of each other first.

Date <- TEST_ROLL$REPORT_SUBMIT_DATE
truth_mat <- abs(sapply(Date, 'difftime', Date, unit = 'day')) < 7
indices <- which(truth_mat, arr.ind = TRUE)
as_tibble(indices) %>% group_by(row) %>%
  summarise_at(vars(col), paste, collapse = ', ')

# # A tibble: 16 x 2
#      row col          
#    <int> <chr>        
#  1     1 1, 2, 3, 4   
#  2     2 1, 2, 3, 4   
#  3     3 1, 2, 3, 4   
#  4     4 1, 2, 3, 4   
#  5     5 5, 6, 7, 8   
#  6     6 5, 6, 7, 8   
#  7     7 5, 6, 7, 8   
#  8     8 5, 6, 7, 8   
#  9     9 9, 10, 11, 12
# 10    10 9, 10, 11, 12
# 11    11 9, 10, 11, 12
# 12    12 9, 10, 11, 12
# 13    13 13, 14       
# 14    14 13, 14       
# 15    15 15, 16       
# 16    16 15, 16 

We can see that {1,2,3,4}, {5,6,7,8}, {9,10,11,12}, {13,14} and {15,16} are forming clusters. Let us if hclust can detect these clusters.

hc <- hclust(dist(Date))
plot(hc)

enter image description here

Here, we can see that we can cut the tree into five branches and we get the desired grouping. We see that the dendogram suggests what we have observed so far. The good thing about going the hclust route is that we can easily specify these groupings.

TEST_ROLL$Group <- cutree(hc, 5)

TEST_ROLL
# # A tibble: 16 x 5
#    CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE  Date
#    <chr>   <chr>   <chr>            <date>             <int>
#  1 1       1001    1425             2013-05-26             1
#  2 1       1001    1532             2013-05-27             1
#  3 1       1001    1666             2013-05-31             1
#  4 1       1001    1989             2013-05-31             1
#  5 1       2002    2166             2016-05-26             2
#  6 1       2002    2457             2016-05-27             2
#  7 1       2002    2776             2016-05-31             2
#  8 1       2002    2999             2016-05-31             2
#  9 1       3003    3409             2018-05-26             3
# 10 1       3003    3497             2018-05-27             3
# 11 1       3003    3595             2018-05-31             3
# 12 1       3003    3786             2018-05-31             3
# 13 2       4004    4105             2017-01-01             4
# 14 2       4004    4511             2017-01-05             4
# 15 2       4005    4567             2017-02-04             5
# 16 2       4005    4580             2017-02-10             5

Note that hclust uses method = 'complete' as default Euclidean distance. You can experiment with other methods as you wish. Check ?hclust for details.

Edit

I just realized that you can also directly use the groupings found in truth_mat and indices this way.

groups <- as_tibble(indices) %>% group_by(row) %>%
  summarise_at(vars(col), paste, collapse = ', ') 
TEST_ROLL$group <- groups$col

Then you can group_by the group column without the need for hclust.

hpesoj626
  • 3,529
  • 1
  • 17
  • 25
  • Yeah that's kind of the problem - this is a small example, but we regularly do this on files of hundreds of thousands of records, so what I'm really looking for is a simple (preferably tidy) way to do this date roll. I will check out using hclust though and see if I can make it work. – kpress Apr 06 '18 at 15:20
  • I just realized that you can use the grouping found with `truth_mat` and `indices` above without using `hclust`. – hpesoj626 Apr 07 '18 at 01:59