0

I am trying to create a program that selects for the closest day within a 30-day range of up to 900 days(1-30,31-60,61-90......871-900). I am using R version 3.3.3.

Here is an example of the dataset I have:

    have <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L), time.to.first = c(0L, 78L, 293L, 0L, 
63L, 0L, 89L, 0L, 11L, 27L, 0L, 28L, 0L, 29L, 0L, 31L, 381L, 
778L, 0L, 28L, 69L, 96L, 466L, 0L, 28L, 56L, 98L, 154L, 220L, 
294L, 395L, 507L), visit = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 
2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)), .Names = c("id", "time.to.first", 
"visit"), row.names = c(NA, 32L), class = "data.frame")

Here is what I would like:

want <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L), time.to.first = c(0L, 78L, 293L, 0L, 
63L, 0L, 89L, 0L, 11L, 27L, 0L, 28L, 0L, 29L, 0L, 31L, 381L, 
778L, 0L, 28L, 69L, 96L, 466L, 0L, 28L, 56L, 98L, 154L, 220L, 
294L, 395L, 507L), visit = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 
2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), time.window = structure(c(1L, 
11L, 5L, 1L, 11L, 1L, 11L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 6L, 
7L, 12L, 1L, 2L, 11L, 13L, 9L, 1L, 2L, 6L, 13L, 3L, 4L, 5L, 8L, 
10L), .Label = c("", "1-30", "151-180", "211-240", "271-300", 
"31-60", "361-390", "391-420", "451-480", "481-510", "61-90", 
"751-780", "91-120"), class = "factor")), .Names = c("id", "time.to.first", 
"visit", "time.window"), row.names = c(NA, 32L), class = "data.frame")

I was able to figure out how to create the date range using a series of ifelse statements and filter with a left_join for the first date range (1-30 days):

x <- 1
y <- 30

df <- have %>% group_by(id) %>% 
               mutate(flag  = ifelse(time.to.first >= x & time.to.first <= y,max(visit),""),
                      flag2 = ifelse(flag  == max(flag)  & flag != "",1,"")) %>%
                      filter(flag > 0 & flag2 == 1) %>%
                      filter(visit == max(visit)) %>%
                      mutate(time = paste(x,"-", y, sep = "")) %>% 
                      dplyr::select(time, id, visit) %>%
                      left_join(have, ., by = c("id","visit"))

I was thinking I could use a double nested for loop for the x and y variables in order to create a program that would do the rest of the date ranges, but I understand that nested loops might not be the most efficient way to go about this.

I was trying to think of a way to make the program a little more robust so I could change the timing of the window (form 30 days to 90,180,360 etc...) but I am not sure how to approach this.

I do not want the code written for me but would love with ideas on function or examples that you think might be helpful. I have been having a difficult time finding more information this type of program so any additional links would be helpful!

daszlosek
  • 1,366
  • 10
  • 19
  • 1
    `have %>% mutate(x = cut(time.to.first, seq(0, ceiling(max(time.to.first)/30)*30, by=30)))`? Related: https://stackoverflow.com/q/5746544/1191259 – Frank Aug 08 '18 at 21:24
  • 1
    wow, this is amazing! This gets me 90% of the way there with much more elegant code. I am going to have to read up on cut and ceiling. – daszlosek Aug 08 '18 at 21:31
  • 1
    library `lubridate` has useful `ceiling_date` and `floor_date` functions, and lets you work with durations as days or months etc. – Andrew Lavers Aug 08 '18 at 22:01
  • 1
    how do you want `0` to be handled if the range is 1-900? should it be counted in `1-30` or be `NA`? – andrew_reece Aug 09 '18 at 05:49
  • 1
    Na would be fine @Frank seems to have answered the majority of my question. I just wrapped up touching it up so I will add it as the answer – daszlosek Aug 09 '18 at 13:48

1 Answers1

1

Using the code snipped from @Frank I am able to do the following:

want1 <- have %>% mutate(x = as.character(cut(time.to.first, seq(0, ceiling(max(time.to.first)/30)*30, by=30)))) %>% group_by(id,x) %>%  filter(visit == max(visit)) %>% mutate(y = x) %>% ungroup()

In the last step, I take out the cases where more than one visit is selected for a specific time range and I take the latest visit (which corresponds to the date closest to the top of the time range). All I have to do now it a simple str_extract_all and maybe a grepl to doll up the time range code.

daszlosek
  • 1,366
  • 10
  • 19