0
> df <- data.frame('unique_ref' = c("a_2016","a_2017","a_2017","a_2016","a_2016"),
+                  'trans_type' = c('NB','NB','CANC','MTA','MTA'),
+                  'incept_dt' = c('01/01/2016','01/01/2017','01/01/2017','01/01/2016','01/01/2016'),
+                  'exp_dt' = c('31/12/2016','31/12/2017','31/12/2017','31/12/2016','31/12/2016'),
+                  'trans_dt' = c('01/01/2016','01/01/2017','01/03/2017','01/07/2016','01/09/2016'))
> df
  unique_ref trans_type  incept_dt     exp_dt   trans_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016
2     a_2017         NB 01/01/2017 31/12/2017 01/01/2017
3     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017
4     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016
5     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016

I have the above dataset format which has a unique_ref and a few dates. I want to be able to sort this dataset by the unique ref and incept_dt and trans_dt:

> df %>% arrange(unique_ref,incept_dt,trans_dt)
  unique_ref trans_type  incept_dt     exp_dt   trans_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016
2     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016
3     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016
4     a_2017         NB 01/01/2017 31/12/2017 01/01/2017
5     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017

Now with this sorted dataset I want to create a new column called trans_end_dt which looks at the row below and picks up that rows trans_dt less 1 day. It should do this for every unique_ref, but stop once it has reached the final unique_ref in that grouping and pick up the exp_dt. I.e. the result should be:

> df %>% arrange(unique_ref,incept_dt,trans_dt)
  unique_ref trans_type  incept_dt     exp_dt   trans_dt trans_end_dt
1     a_2016         NB 01/01/2016 31/12/2016 01/01/2016 30/06/2016 #this is 01/07/2016 minus one day
2     a_2016        MTA 01/01/2016 31/12/2016 01/07/2016 31/08/2016 #same logic as above
3     a_2016        MTA 01/01/2016 31/12/2016 01/09/2016 31/12/2016 #next row is a new unique_ref so the value should just be the exp_dt which is 31/12/2016
4     a_2017         NB 01/01/2017 31/12/2017 01/01/2017 28/02/2017
5     a_2017       CANC 01/01/2017 31/12/2017 01/03/2017 31/12/2017

Does anyone know how I can do this? Preferably using dplyr but I am struggling to get this to work so any solution would be great

user33484
  • 740
  • 2
  • 9
  • 36

2 Answers2

1

My approach was to:

  • Change columns ending in dt to Date format
  • Sort as previously done by unique_ref,incept_dt,trans_dt
  • For each unique_ref (using group_by) check if last row in group; if it is, use exp_dt, and if not, use lead(trans_dt) to get the following row data element and subtract 1

    library(dplyr)
    
    df %>% 
      mutate_at(vars(ends_with("_dt")), as.Date, format = "%d/%m/%Y") %>%
      arrange(unique_ref,incept_dt,trans_dt) %>%
      group_by(unique_ref) %>%
      mutate(trans_end_dt = if_else(row_number() == n(), exp_dt,  lead(trans_dt) - 1))
    
      unique_ref trans_type incept_dt  exp_dt     trans_dt   trans_end_dt
      <fct>      <fct>      <date>     <date>     <date>     <date>      
    1 a_2016     NB         2016-01-01 2016-12-31 2016-01-01 2016-06-30  
    2 a_2016     MTA        2016-01-01 2016-12-31 2016-07-01 2016-08-31  
    3 a_2016     MTA        2016-01-01 2016-12-31 2016-09-01 2016-12-31  
    4 a_2017     NB         2017-01-01 2017-12-31 2017-01-01 2017-02-28  
    5 a_2017     CANC       2017-01-01 2017-12-31 2017-03-01 2017-12-31 
    

Edit: As suggested by @Gregor, you can alternatively replace if/else with coalesce (which will fill in NA values):

mutate(trans_end_dt = coalesce(lead(trans_dt) - 1, exp_dt))

Edit: The group_by statement is needed so that the following operations after the %>% pipe are performed "by group." Even though the data has been sorted via arrange, the following mutate statement would consider all of the rows at one time, instead of subgroups a_2016, a_2017, etc. individually.

This is necessary for row_number() == n() where it compares the row number evaluated within the group to the total number of rows in the group (see Select first and last row from grouped data regarding selecting last row of grouped data).

For example, when unique_ref is grouped, there are 3 rows (n() = 3) for a_2016, so if row_number() is 3, essentially the third line in the a_2016 group (or last in the group), then use exp_dt.

Ben
  • 28,684
  • 5
  • 23
  • 45
  • Thanks for this. Could you explain how row_number() == n() works? Why do we need to group_by here when it is already grouped? – user33484 Aug 22 '19 at 19:14
  • Please see edited answer and let me know if helpful. – Ben Aug 22 '19 at 19:29
  • 1
    One suggestion, I think you could replace the `ifelse` with `trans_end_dt = coalesce(lead(train_dt) - 1, exp_dt)`. `lead` will give an `NA` if there is nothing in the next row, and `coalesce` is useful for filling in `NA`s, a little more cleanly than `ifelse` – Gregor Thomas Aug 22 '19 at 19:31
  • Thank you - will edit and provide as alternative in answer. Much appreciated! – Ben Aug 22 '19 at 19:39
  • This is great. Thank you for the explanation. – user33484 Aug 22 '19 at 19:43
0

You can use lead with the default being the last day of exp_dt

library(tidyverse)
library(lubridate)

df %>% 
  arrange(unique_ref,incept_dt,trans_dt)%>%
  group_by(unique_ref)%>%
  mutate(trans_end_dt = strftime(lead(dmy(trans_dt)-1,1,dmy(exp_dt[n()])),"%d/%m/%Y"))#or format instead of strptime

# A tibble: 5 x 6
# Groups:   unique_ref [2]
  unique_ref trans_type incept_dt  exp_dt     trans_dt   trans_end_dt
  <fct>      <fct>      <fct>      <fct>      <fct>      <chr>       
1 a_2016     NB         01/01/2016 31/12/2016 01/01/2016 30/06/2016  
2 a_2016     MTA        01/01/2016 31/12/2016 01/07/2016 31/08/2016  
3 a_2016     MTA        01/01/2016 31/12/2016 01/09/2016 31/12/2016  
4 a_2017     NB         01/01/2017 31/12/2017 01/01/2017 28/02/2017  
5 a_2017     CANC       01/01/2017 31/12/2017 01/03/2017 31/12/2017 
Onyambu
  • 67,392
  • 3
  • 24
  • 53