3

I am looking for a method that will look at each date in "Date A" and find the next nearest date after that value in "Date B" by ID (group_by). I then want to calculate the difference in days. Below is the table that I would like.

ID    |  Date A    |  Date B     | Difference| 
11111 |  09/01/21  |  09/03/21   | 2         |
22222 |  09/06/21  |  09/20/21   | 11        | 
11111 |  09/08/21  |  09/18/21   | 10        |  
44444 |  09/04/21  |  NA         | 11        | 
44444 |  09/10/21  |  09/15/21   | 5         |
22222 |  NA        |  09/17/21   | NA        | 
77777 |  NA        |  10/16/21   | NA        |
77777 |  09/04/21  |  10/17/21   | 24        |
77777 |  09/01/21  |  09/28/21   | 27        |

If you could please help me out with this, I would greatly appreciate it!

Cheers

brandooo23
  • 105
  • 7
  • How big is this dataset? –  May 05 '22 at 16:57
  • It is about 1500 rows. I do have access to python and excel as well. But I am trying to get away from Excel as it tends to bog my computer way down. – brandooo23 May 05 '22 at 17:07
  • Do you ever have the same value for Date A or Date b within the same ID? Said another way, within each ID group is Date A unique? Is Date B unique as well? – jmuhlenkamp May 06 '22 at 19:28
  • Also, is Date B always > Date A? – jmuhlenkamp May 06 '22 at 19:31
  • Likely some relevant/useful answers in this question: https://stackoverflow.com/questions/37289405/dplyr-left-join-by-less-than-greater-than-condition – jmuhlenkamp May 06 '22 at 20:23
  • Date A and Date B are not necessarily unique. Date A are scheduled events and Date B are when the event actually happened. – brandooo23 May 06 '22 at 20:46
  • Date B is not necessarily > Date A either. If one person has two events scheduled under Date A but 3 weeks apart, then the first event for that person under Date B will be < the second scheduled event under Date A. – brandooo23 May 06 '22 at 20:47

1 Answers1

1

A dplyr solution via group_by solution is not obvious to me here, but here is a relatively straightforward sqldf solution. Presumably this could be translated into a dplyr solution if you really wanted.

First mock up the data within R

df <- dplyr::tribble(
  ~'ID', ~'Date A', ~'Date B',
  11111,  '09/01/21', '09/03/21',
  22222,  '09/06/21', '09/20/21',
  11111,  '09/08/21', '09/18/21',
  44444,  '09/04/21', NA        ,
  44444,  '09/10/21', '09/15/21',
  22222,  NA        , '09/17/21',
  77777,  NA        , '10/16/21',
  77777,  '09/04/21', '10/17/21',
  77777,  '09/01/21', '09/28/21'
)
df$`Date A` <- lubridate::mdy(df$`Date A`)
df$`Date B` <- lubridate::mdy(df$`Date B`)
df

Which looks like

# A tibble: 9 x 3
     ID `Date A`   `Date B`  
  <dbl> <date>     <date>    
1 11111 2021-09-01 2021-09-03
2 22222 2021-09-06 2021-09-20
3 11111 2021-09-08 2021-09-18
4 44444 2021-09-04 NA        
5 44444 2021-09-10 2021-09-15
6 22222 NA         2021-09-17
7 77777 NA         2021-10-16
8 77777 2021-09-04 2021-10-17
9 77777 2021-09-01 2021-09-28

Then do an inequality join combined with a group by. The column I is added to allow for nuances of the data such as multiple of the same Date A within each ID

df$I <- 1:nrow(df)

df <- sqldf::sqldf('
SELECT a.I, a.ID, a."Date A", a."Date B",
  MIN(b."Date B") AS NextB
FROM df a
LEFT JOIN df b
  ON a.ID = b.ID
  AND a."Date A" < b."Date B"
GROUP BY a.I, a.ID, a."Date A", a."Date B"
ORDER BY a.I
')

df$Difference = df$NextB - as.integer(df$`Date A`)
df$I <- NULL
df$NextB <- NULL
df

Which matches your example data (and should generalize well for edge cases not in your example data). Unclear how well it might scale up to non-trivial data.

     ID     Date A     Date B Difference
1 11111 2021-09-01 2021-09-03          2
2 22222 2021-09-06 2021-09-20         11
3 11111 2021-09-08 2021-09-18         10
4 44444 2021-09-04       <NA>         11
5 44444 2021-09-10 2021-09-15          5
6 22222       <NA> 2021-09-17         NA
7 77777       <NA> 2021-10-16         NA
8 77777 2021-09-04 2021-10-17         24
9 77777 2021-09-01 2021-09-28         27
jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
  • I don't have access to the data currently, but will definitely give this one a go when I return on Monday. What package is allowing you to use SQL queries? – brandooo23 May 06 '22 at 20:52
  • `sqldf` package. `sqldf::sqldf(...)` is similar to `library(sqldf)` followed by `sqldf(...)` – jmuhlenkamp May 07 '22 at 22:00