-2

I hope this makes sense - it's my first post here so I'm sorry if the question is badly formed.

I have tables OldData and NewData:

OldData

ID DateFrom   DateTo      Priority
1  2018-11-01 2018-12-01* 5
1  2018-12-01 2019-02-01  5
2  2017-06-01 2018-03-01  5
2  2018-03-01 2018-04-05* 5

NewData

ID DateFrom   DateTo      Priority
1  2018-11-13 2018-12-01  6
2  2018-03-21 2018-05-01  6

I need merge these tables as below. Where IDs match, dates overlap, and Priority is higher in NewData, I need to update the dates in OldData to reflect NewData.

ID DateFrom   DateTo     Priority
1  2018-11-01 2018-11-13 5
1  2018-11-13 2018-12-01 6
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-03-21 5
2  2018-03-21 2018-05-01 6

I first tried to run nested for loops through each table, matching criteria and making changes one at a time, but I'm sure there is a much better way. e.g. possibly using sql in r?

keepalex
  • 3
  • 4
  • If you're looking for a `merge`/join operation, then have 4 rows + 2 rows go into 6 rows doesn't make sense. Is this just a filtered `rbind` instead? – r2evans Sep 29 '20 at 14:18
  • Maybe merge is not right, but the mportant point is that I need to edit the existing rows of OldData. `rbind` would simply be adding in the newdata without affecting the old, right? – keepalex Sep 29 '20 at 14:21
  • There are two fields that are "updated" (`DateTo`, two rows), but it's not clear if that is because of a typo or due to actual update-logic. (ID 1 first row is `11-01` to `12-01` in OldData, but in the desired output that same row is changed to be `11-01` to `11-13`.) – r2evans Sep 29 '20 at 14:46
  • 1
    That's desired. NewData has higher priority, so the timespan for ID 1 (2018-11-13 to 2018-12-01) 'overwrites' the OldData timespan, and cuts it short. – keepalex Sep 29 '20 at 14:51

2 Answers2

0

To my understanding this should be helpful:

library(datatable)

df_new <- setDT(df_new)
df_old <- setDT(df_old)

df_all <- rbind(df_old, df_new)

df_all[, .SD[.N], by = .(ID, DateFrom, DateTo)] 

You simply rbind both dataframes, then group the resulting df by ID, DateFrom and DateTo. Within each group you extract the last row (i.e. the latest). This would result in a dataframe that is basically equal to df_old, except that in some cases the values will be 'updated' with the values from df_new. Should df_new have new groups (i.e. combinations of ID, DateFrom and DateTo) then those rows are also in it.

Edit: (after your comment)

df_all[, .SD[.N], by = .(ID, DateFrom)] 
peter
  • 756
  • 5
  • 16
  • Thanks - I tried this but it does not update the OldData rows as required. e.g. the first record for ID 1 should have DateTo changed to 2018-11-13, because that's when the first record for ID 1 in NewData begins. – keepalex Sep 29 '20 at 15:00
  • Ah now I understand better. You simply have to take 'DateTo' out of the group: df_all[, .SD[.N], by = .(ID, DateFrom)] – peter Sep 29 '20 at 15:22
  • Your edit doesn't seem to affect the output for me. I edited the post with stars to show the DateFrom columns that should be 'overwritten' – keepalex Sep 29 '20 at 15:39
  • Ah, I think your description regarding the dates is not very understandable. But I now guess what you mean with overlapping is when df_new$DateFrom > df_old$DateFrom and df_new$DateFrom < df_old$DateTo. Is that correct? – peter Sep 29 '20 at 16:07
  • I'm sure you're right it's not a good description, and maybe the sample data doesn't explain it very well either. But yes overlapping can be OldData$DateFrom < NewData$DateFrom < OldData$DateTo and/or OldData$DateFrom < NewData$Dateto < OldData$DateTo – keepalex Sep 29 '20 at 16:22
0

In general, I interpret this to be an rbind operation with some cleanup: per-ID, if there is any overlap in the date ranges, then the lower-priority date range is truncated to match. Though not shown in the data, if you have situations where two higher-priority rows may completely negate a middle row, then you might need to add to the logic (it might then turn into an iterative process).

tidyverse

library(dplyr)
out_tidyverse <- bind_rows(OldData, NewData) %>%
  arrange(ID, DateFrom) %>%
  group_by(ID) %>%
  mutate(
    DateTo = if_else(row_number() < n() &
                       DateTo > lead(DateFrom) & Priority < lead(Priority),
                     lead(DateFrom), DateTo),
    DateFrom = if_else(row_number() > 1 &
                         DateFrom < lag(DateTo) & Priority < lag(Priority),
                       lag(DateTo), DateFrom)
  ) %>%
  ungroup()
out_tidyverse
# # A tibble: 6 x 4
#      ID DateFrom   DateTo     Priority
#   <int> <chr>      <chr>         <int>
# 1     1 2018-11-01 2018-11-13        5
# 2     1 2018-11-13 2018-12-01        6
# 3     1 2018-12-01 2019-02-01        5
# 4     2 2017-06-01 2018-03-01        5
# 5     2 2018-03-01 2018-03-21        5
# 6     2 2018-03-21 2018-05-01        6

### confirm it is the same as your expected output
all(mapply(`==`, FinData, out_tidyverse))
# [1] TRUE

data.table

I am using magrittr here in order to break out the flow in a readable fashion, but it is not required. If you're comfortable with data.table by itself, then translating from the magrittr::%>% to a native data.table piping should be straight-forward.

Also, I am using as.data.table instead of the often-preferred side-effect setDT, primarily so that you don't use it on your production frame and not realize that many data.frame operations in R (on those two frames) now behave somewhat differently. If you're up for using data.table, then feel free to step around this precaution.

library(data.table)
library(magrittr)
OldDT <- as.data.table(OldData)
NewDT <- as.data.table(NewData)
out_DT <- rbind(OldDT, NewDT) %>%
  .[ order(ID, DateFrom), ] %>%
  .[, .i := seq_len(.N), by = .(ID) ] %>%
  .[, DateTo := fifelse(.i < .N &
                          DateTo > shift(DateFrom, type = "lead") &
                          Priority < shift(Priority, type = "lead"),
                        shift(DateFrom, type = "lead"), DateTo),
    by = .(ID) ] %>%
  .[, DateFrom := fifelse(.i > 1 &
                            DateFrom < shift(DateTo) &
                            Priority < shift(Priority),
                          shift(DateTo), DateFrom),
    by = .(ID) ] %>%
  .[, .i := NULL ]

out_DT[]
#    ID   DateFrom     DateTo Priority
# 1:  1 2018-11-01 2018-11-13        5
# 2:  1 2018-11-13 2018-12-01        6
# 3:  1 2018-12-01 2019-02-01        5
# 4:  2 2017-06-01 2018-03-01        5
# 5:  2 2018-03-01 2018-03-21        5
# 6:  2 2018-03-21 2018-05-01        6

all(mapply(`==`, FinData, out_DT))
# [1] TRUE

Data:

OldData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-01 2018-12-01 5
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-04-05 5")
NewData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-13 2018-12-01 6
2  2018-03-21 2018-05-01 6")

FinData <- read.table(header = TRUE, text="
ID DateFrom   DateTo     Priority
1  2018-11-01 2018-11-13 5
1  2018-11-13 2018-12-01 6
1  2018-12-01 2019-02-01 5
2  2017-06-01 2018-03-01 5
2  2018-03-01 2018-03-21 5
2  2018-03-21 2018-05-01 6")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    This is really helpful - thank you. The logic of binding rows and then using lead() and lag() is exactly what I needed. I still need to do some tweaking for my main dataset but now I have good base to build on. Thanks again. – keepalex Sep 29 '20 at 17:38