1

I have a set of data with ID's and dates.

For each overlapping date, I would like to replace the overlap with the next row (ie. combine the overlapping dates).

NB: There are ID's that have only one row and therefore do not need to be altered. There are those that don't have an overlap and need to remain as they are (ie. with two rows).

Example data:

ID Start      End
1  2007-02-01 2007-03-03  
1  2007-03-01 2007-03-31  
1  2007-09-01 2008-07-31  
6  2011-02-05 2011-03-12  
5  2012-11-16 2012-12-26  
4  2015-01-03 2015-02-14  
3  2008-08-02 2008-09-11  
7  2010-09-22 2010-10-22  
7  2010-09-24 2010-10-24  
7  2010-09-26 2010-10-26  
7  2010-09-28 2010-10-28


ID Start      End
1  2007-02-01 2007-03-31  
1  2007-09-01 2008-07-31  
6  2011-02-05 2011-03-12  
5  2012-11-16 2012-12-26  
4  2015-01-03 2015-02-14  
3  2008-08-02 2008-09-11  
7  2010-09-22 2010-10-28
sar
  • 182
  • 6
  • 26
  • 1
    Can you share your data using `dput()`? See more here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Tung Mar 18 '18 at 02:20
  • 1
    Try `library(dplyr); df1 %>% group_by(ID) %>% summarise(start = first(start), end = last(end))` – akrun Mar 18 '18 at 02:30
  • I have looked at this question here: https://stackoverflow.com/questions/38617840/how-to-remove-rows-from-a-dataframe-that-have-overlapping-start-and-end-dates-in?rq=1 But these solutions unfortunately don't work for me – sar Mar 18 '18 at 03:52
  • setDT(df) df.1 <- df[, gr := cumsum(start - shift(end, fill=1) != 1), ][, list(start=min(start), end=max(end)), by=.(ID)] – sar Mar 27 '18 at 10:05
  • Solved with solution from @David Arunberg https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods-in-r – sar Mar 28 '18 at 04:07

1 Answers1

2

Based on the example, after grouping by 'ID', we take the first of 'start' and last of 'end'

library(dplyr)
df1 %>%
   group_by(ID) %>%
   summarise(start = first(start), end = last(end)) 

Update

Based on the updated example in the OP's post

library(data.table)
df1 %>% 
    mutate_at(2:3, as.Date, format = "%d/%m/%y") %>%
    group_by(ID) %>% 
    group_by(grp = rleid(lead(start, default = last(start)) < end), add = TRUE) %>% 
    summarise(start = first(start), end = last(end)) %>%
    ungroup %>% 
    select(-grp)  %>% 
    mutate_at(2:3, format, format = "%d/%m/%y")
# A tibble: 7 x 3
#     ID start    end     
#  <int> <chr>    <chr>   
#1    84 27/03/09 21/07/17
#2    92 20/04/12 25/01/17
#3   108 12/12/14 25/08/17
#4   111 31/01/14 18/11/16
#5   114 10/04/13 15/07/13
#6   130 05/01/11 04/03/12
#7   130 15/05/12 27/09/13

data

df1 <- structure(list(ID = c(84L, 84L, 92L, 92L, 92L, 108L, 111L, 114L, 
130L, 130L), start = c("27/03/09", "23/02/13", "20/04/12", "18/07/14", 
"5/12/15", "12/12/14", "31/01/14", "10/04/13", "5/01/11", "15/05/12"
), end = c("24/03/13", "21/07/17", "17/08/14", "4/01/16", "25/01/17", 
"25/08/17", "18/11/16", "15/07/13", "4/03/12", "27/09/13")), .Names = c("ID", 
"start", "end"), class = "data.frame", row.names = c(NA, -10L
 ))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thank you for the suggestion. When I run this I get just one start date and one end date. – sar Mar 18 '18 at 03:04
  • Also I am not sure this approach would work for ID's that need to end up with several rows. I have changed my example data above to reflect this option – sar Mar 18 '18 at 03:13
  • thank you @akrun. When I try your solution I get this error Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "character" – sar Mar 27 '18 at 10:04
  • perhaps the problem is that my ID is a combination of letters and numbers and is in character format – sar Mar 27 '18 at 10:21
  • 1
    @sar I tried by converting the 'ID' to character, it is still working. What is your `dplyr` version? – akrun Mar 27 '18 at 10:34
  • yes @akrun, you are right, it does work with character. apologies, it was an error in my transcription of your code – sar Mar 27 '18 at 10:39
  • now i get this error - Error in FUN(X[[i]], ...) : object 'grp' not found – sar Mar 27 '18 at 10:40
  • With your example data I get this result # A tibble: 8 x 3 ID start end 1 84 NA NA 2 92 NA NA 3 92 NA 04/01/0016 4 92 05/12/0015 NA 5 108 12/12/0014 NA 6 111 NA NA 7 114 10/04/0013 NA 8 130 05/01/0011 NA – sar Mar 27 '18 at 10:57
  • 1
    @sar I don't get any NAs. what is your dplyr version. I use `0.7.4` – akrun Mar 27 '18 at 11:05
  • thank you @akrun. I found the problems was that I had loaded plyr prior to dplyr. After removing plyr it ran, however, it doesn't actually combine the overlapping rows per ID. – sar Mar 27 '18 at 22:29
  • ID start end 4 27/03/09 23/02/13 4 23/02/13 21/07/17 2 20/04/12 05/12/15 2 05/12/15 25/01/17 8 12/12/14 25/08/17 1 31/01/14 18/11/16 7 10/04/13 15/07/13 9 19/03/14 31/07/15 3 27/02/15 20/10/17 10 11/04/14 18/08/17 – sar Mar 27 '18 at 22:29
  • 1
    Solved! Just needed >= simple! Thank you @akrun~ – sar Mar 27 '18 at 23:41
  • Unfortunately, I thought this was working but it when I have tried it on other sets of data it doesn't work. I will upload some more data as an example – sar Mar 28 '18 at 03:22
  • Solved with a solution from @David Arunberg https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods-in-r – sar Mar 28 '18 at 04:06
  • Also the additional comment from @Zachary Keller was also very useful. I needed to use the min and max commands in his solution – sar Apr 05 '18 at 03:43