This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:
The data:
Dataframe A:
lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.
Patient ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
Dataframes of Type B contain test results at other dates, also by patient ID:
Patient ID Test.Date Test.Value
1 11/30/1999 negative
1 01/15/2000 700 copies/uL
1 01/27/2000 900 copies/uL
2 03/30/2003 negative
What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.
desired output:
-->
Patient ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000 negative
1 01/30/2000 900 copies/ul
2 04/01/2003 negative
3 04/01/2003 NA
I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.
How can I explain R to do that?
Edit 1: Here is the R-code for these examples
df_A <- data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000',
'04/01/2003','04/01/2003'),'%m/%d/%Y'))
df_B <- data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000',
'03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative',
'700 copies/ul','900 copies/ul','negative'))
Edit 2:
To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.