Let's assume we are confronted with two different sized dataframes:
x = data.frame(c("6732", "2312", "3310", "0412", "0312"), c("3310", "0219", "0011", "3324", "3411"),
c("2019-06-01", "2013-01-01", "2019-03-01", "2018-02-01", "2018-01-01"))
colnames(x) = c("ID", "Case_ID", "Date")
x$Date <- as.Date(x$Date)
y = data.frame(c("1243", "6732", "1102", "0412", "2312", "3401", "1123", "3310"), c("0010", "3310", "0123", "3324", "0219", "3210", "0021", "1134"),
c("2014-06-01", "2015-01-01", "2016-01-01", "2015-03-01", "2014-01-01", "2014-01-01", "2014-06-01", "2013-01-01"),
c("2019-11-01", "2019-10-01", "2020-10-01", "2020-01-01", "2017-11-01", "2020-02-01", "2020-02-01", "2020-01-01"))
colnames(y) = c("ID", "Case_ID", "From", "To")
y$From <- as.Date(y$From)
y$To <- as.Date(y$To)
What I now want is to create a new column in dataframe x that is based on the "From" column of dataframe y if some conditions are fulfilled.
I have two different desired outputs:
The first is based on the condition that the "From" entry of the corresponding "ID" and "Case_ID" are filled into the correct row of the new column in x. If the ID or Case_ID is not in y the respective entry should be NA, such that:
ID | Case_ID | Date | From |
---|---|---|---|
6732 | 3310 | 2019-06-01 | 2015-01-01 |
2312 | 0219 | 2013-01-01 | 2014-01-01 |
3310 | 0011 | 2019-03-01 | NA |
0412 | 3324 | 2018-02-01 | 2015-03-01 |
0312 | 3411 | 2018-01-01 | NA |
For the second output, I extended the conditon such that besides the first two conditions of the same "ID" and "Case-ID" the "From" entries of y should only be taken if smaller than the respective "Date" in x, such that:
ID | Case_ID | Date | From |
---|---|---|---|
6732 | 3310 | 2019-06-01 | 2015-01-01 |
2312 | 0219 | 2013-01-01 | NA |
3310 | 0011 | 2019-03-01 | NA |
0412 | 3324 | 2018-02-01 | 2015-03-01 |
0312 | 3411 | 2018-01-01 | NA |
Due to efficiency reasons (huge dataset) I want to avoid for loops and would like to solve this problem with dplyr.
What I tried for the first case:
library(dplyr)
z <- x %>%
mutate(from = ifelse(ID %in% y$ID & Case_ID %in% y$Case_ID , y$From, NA))
z$from <- as.Date(z$from, origin = "1970-01-01")
# using == instead of %in% leads to an error
But in this case I don't get the correct "From" values here, instead I get this:
ID | Case_ID | Date | from |
---|---|---|---|
6732 | 3310 | 2019-06-01 | 2014-06-01 |
2312 | 0219 | 2013-01-01 | 2015-01-01 |
3310 | 0011 | 2019-03-01 | NA |
0412 | 3324 | 2018-02-01 | 2016-01-01 |
0312 | 3411 | 2018-01-01 | NA |
For the second case with the extended condition I get following error:
library(dplyr)
z <- x %>%
mutate(from = ifelse(ID %in% y$ID & Case_ID %in% y$Case_ID & Date >= y$From, y$from, NA))
Error in `mutate()`:
i In argument: `from = ifelse(...)`.
Caused by error:
! `from` must be size 5 or 1, not 8.
I already did some research and I understand the error here. However, I couldn't find any solution for my problem.