0

Let's assume we are confronted with two different sized dataframes:

x = data.frame(c("6732", "2312", "3310", "0412", "0312"), c("3310", "0219", "0011", "3324", "3411"), 
               c("2019-06-01", "2013-01-01", "2019-03-01", "2018-02-01", "2018-01-01"))

colnames(x) = c("ID", "Case_ID", "Date")
x$Date <- as.Date(x$Date)
y = data.frame(c("1243", "6732", "1102", "0412", "2312", "3401", "1123", "3310"), c("0010", "3310", "0123", "3324", "0219", "3210", "0021", "1134"), 
               c("2014-06-01", "2015-01-01", "2016-01-01", "2015-03-01", "2014-01-01", "2014-01-01", "2014-06-01", "2013-01-01"), 
               c("2019-11-01", "2019-10-01", "2020-10-01", "2020-01-01", "2017-11-01", "2020-02-01", "2020-02-01", "2020-01-01"))
colnames(y) = c("ID", "Case_ID", "From", "To")
y$From <- as.Date(y$From)
y$To <- as.Date(y$To)

What I now want is to create a new column in dataframe x that is based on the "From" column of dataframe y if some conditions are fulfilled.

I have two different desired outputs:

The first is based on the condition that the "From" entry of the corresponding "ID" and "Case_ID" are filled into the correct row of the new column in x. If the ID or Case_ID is not in y the respective entry should be NA, such that:

ID Case_ID Date From
6732 3310 2019-06-01 2015-01-01
2312 0219 2013-01-01 2014-01-01
3310 0011 2019-03-01 NA
0412 3324 2018-02-01 2015-03-01
0312 3411 2018-01-01 NA

For the second output, I extended the conditon such that besides the first two conditions of the same "ID" and "Case-ID" the "From" entries of y should only be taken if smaller than the respective "Date" in x, such that:

ID Case_ID Date From
6732 3310 2019-06-01 2015-01-01
2312 0219 2013-01-01 NA
3310 0011 2019-03-01 NA
0412 3324 2018-02-01 2015-03-01
0312 3411 2018-01-01 NA

Due to efficiency reasons (huge dataset) I want to avoid for loops and would like to solve this problem with dplyr.

What I tried for the first case:

library(dplyr)

z <- x %>%
  mutate(from = ifelse(ID %in% y$ID & Case_ID %in% y$Case_ID , y$From, NA))
 z$from <- as.Date(z$from, origin = "1970-01-01")

  # using == instead of %in% leads to an error

But in this case I don't get the correct "From" values here, instead I get this:

ID Case_ID Date from
6732 3310 2019-06-01 2014-06-01
2312 0219 2013-01-01 2015-01-01
3310 0011 2019-03-01 NA
0412 3324 2018-02-01 2016-01-01
0312 3411 2018-01-01 NA

For the second case with the extended condition I get following error:

library(dplyr)

z <- x %>%
  mutate(from = ifelse(ID %in% y$ID & Case_ID %in% y$Case_ID & Date >= y$From, y$from, NA))

Error in `mutate()`:
i In argument: `from = ifelse(...)`.
Caused by error:
! `from` must be size 5 or 1, not 8.

I already did some research and I understand the error here. However, I couldn't find any solution for my problem.

Hans109
  • 3
  • 2
  • Your first case would be a `dplyr::left_join()` using `ID` and `CaseID` as keys. Your second case is the same, followed by `dplyr::filter(From < Date)`. See [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). – zephryl Feb 24 '23 at 12:29
  • 1
    Nice, thank you! For the second case I extended the code, such that: z <- x %>% left_join(y, by = c("ID", "Case_ID")) %>% mutate(From = as.Date(ifelse(From < Date, From, NA), origin = "1970-01-01")) – Hans109 Feb 24 '23 at 13:04
  • 1
    Yes, thanks. But what do I do in the case where multiple ID's and Case_ID's match and the only way to find the correct entry is by expanding the condition as in the second case? – Hans109 Feb 24 '23 at 13:27
  • I don’t quite follow, can you elaborate? Or post as a new question, if you don’t think it’s covered in the linked thread about joins. – zephryl Feb 24 '23 at 14:52
  • It’s possible that some of the advanced join features introduced in dplyr 1.1.0 would help — see [this post](https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/), particularly the sections on inequality joins and rolling joins. – zephryl Feb 24 '23 at 15:00

1 Answers1

0

First dataframe:

df1 <- x %>% 
  left_join(y, by = c("ID", "Case_ID")) %>% 
  select(-To)

Second dataframe:

df2 <- df1 %>% 
  mutate(
    From2 = as.Date(ifelse(From < Date, From, NA), origin = "1970-01-01")
  )

Same approach as zephryl pointed out, just without filtering df2.

thesixmax
  • 164
  • 1
  • 9
  • Thanks, I now did it like this: z <- x %>% left_join(y, by = c("ID", "Case_ID")) %>% mutate(From = as.Date(ifelse(From < Date, From, NA), origin = "1970-01-01")) – Hans109 Feb 24 '23 at 13:06
  • That works. Please mark as accepted if you believe it answered your question. – thesixmax Feb 24 '23 at 13:14
  • Another question: what do I do in the case where multiple ID's and Case_ID's match and the only way to find the correct entry is by expanding the condition as in the second case? – Hans109 Feb 24 '23 at 13:28
  • 1
    That depends on the specific case. I suggest you open a new question if that issue pops up. But probably read up on the different joins beforehand. – thesixmax Feb 24 '23 at 13:34