4

This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:

The data:

Dataframe A:

lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.

Patient ID Transfusion.Date
1          01/01/2000
1          01/30/2000
2          04/01/2003
3          04/01/2003

Dataframes of Type B contain test results at other dates, also by patient ID:

Patient ID  Test.Date   Test.Value
1           11/30/1999   negative
1           01/15/2000   700 copies/uL
1           01/27/2000   900 copies/uL
2           03/30/2003   negative

What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.

desired output:

-->

Patient ID Transfusion.Date Pre.Transfusion.Test
1          01/01/2000       negative
1          01/30/2000       900 copies/ul
2          04/01/2003       negative
3          04/01/2003       NA

I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.

How can I explain R to do that?

Edit 1: Here is the R-code for these examples

df_A <- data.frame(MRN = c(1,1,2,3), 
                   Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000', 
                   '04/01/2003','04/01/2003'),'%m/%d/%Y')) 

df_B <- data.frame(MRN = c(1,1,1,2), 
                   Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000', 
                   '03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative', 
                   '700 copies/ul','900 copies/ul','negative'))

Edit 2:

To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.

Arun
  • 116,683
  • 26
  • 284
  • 387
col. slade
  • 451
  • 4
  • 13
  • 1
    This should get you started `merge(df_A, df_B, by.x = "Patient.ID", by.y = "Patient.ID", all.x = TRUE)` then try using package `lubridate` to add column for the date differences. – JasonAizkalns Nov 26 '14 at 15:52
  • @jaysunice3401 Thank you for your answer, this is where I am now, but this way I get a lot of additional rows for all the times there are more test results than blood transfusions. I think what I need to do is reduce df_B to the same amount of pt ID rows, using the package you told me... looking into that now! – col. slade Nov 26 '14 at 16:02
  • 2
    Also check out `dplyr` for easy `group_by()` summaries via `summarise()` -- group by the combination of Patient ID and Date and then grab the records corresponding to the minimum distances. – JasonAizkalns Nov 26 '14 at 16:04
  • @jaysunice3401, can you help me some more, I have no idea how I need to combine this with BondedDust's code, (?which does not seem to need lubridate?) to come up with a working solution. – col. slade Nov 26 '14 at 19:51
  • @jaysunice3401 thank you for mentioning dplyr, as you suggested, I used merge: df_AB <- merge(df_A, df_B, by.x = "Patient.ID", by.y = "Patient.ID", all.x = TRUE) df_AB %>% mutate(Date.difference = Test.Date - Transfusion.Date) gives me all the date differences, which is very helpful. Now trying to subset all rows with the minimal negative Date.difference – col. slade Nov 27 '14 at 15:40

3 Answers3

6

Here's using data.table's rolling joins:

require(data.table)
setkey(setDT(df_A), MRN, Transfusion.Date)
setkey(setDT(df_B), MRN, Test.Date)

df_B[df_A, roll=TRUE]
#    MRN  Test.Date   Test.Result
# 1:   1 2000-01-01      negative
# 2:   1 2000-01-30 900 copies/ul
# 3:   2 2003-04-01      negative
# 4:   3 2003-04-01            NA
  • setDT converts data.frame to data.table by reference (without any additional copying). That'll result in df_A and df_B now being data.tables.

  • setkey sorts the data.table by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.

  • We perform a join of the form x[i] on the key columns, where for each row of i, the matching rows of x (if any, else NA) along with i's rows are returned. This is what we call an equi-join. By adding roll = TRUE, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due to setkey()) ensures that the last observation is the most recent date.

HTH

Arun
  • 116,683
  • 26
  • 284
  • 387
  • wow thank you seems I really need to look into data.table as well, it is very concise and the rolling join / LOCF feature looks awesome. Especially thanks for the explanation what data.table does with the code. Is there a way to specify 'boundaries' for the roll feature, e.g. to only rolling join if the test is on the day or no more than 10 days prior to Transfusion? Again, thanks a lot! – col. slade Dec 08 '14 at 14:49
  • 1
    Thanks, glad to know. Yes, `roll = TRUE` is equivalent to `roll=Inf` - LOCF. You can do `roll = 10` or -10 as well. `roll = -Inf` does NOCB - next observation carried backwards. `roll = "nearest"` rolls to the nearest value. Check `?data.table` and examples as well. We're also in the process of [creating detailed vignettes](https://github.com/Rdatatable/data.table/issues/944). – Arun Dec 08 '14 at 14:59
0
 dfLast <- df_B[ df_B$Test.Date %in% 
  as.Date( tapply(df_B$Test.Date, df_B$MRN, tail,1),"1970-01-01"), ]
 merge(df_A, dfLast, by=c(1:2,1:2) ,all.y=TRUE)
  MRN Transfusion.Date   Test.Result
1   1       2000-01-27 900 copies/ul
2   2       2003-03-30      negative

Edited. Had some logical errors and some sytactic errors. tapply returned the integer values of the Dates and as you pointed out I was using the wrong column name in the data reduction step.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • BondedDust, can you explain this solution? It seems to subset df_B but I am not sure does it also work to get the test value that is closest prior to transfusion date? Also there is no Transfusion.Date in df_B, only Test.Date ?!? – col. slade Nov 26 '14 at 16:36
  • I was trying to create a reduced dataset for the transfusion records that had only the last date. We then only merge on `'Patient ID'` so that we can keep both the date values. I'll test it if you add the dput output from your examples as an edit to your question. – IRTFM Nov 26 '14 at 16:46
  • Thank you, I cannot dput the actual data, but I will try to recreate the examples – col. slade Nov 26 '14 at 17:00
  • OK I pasted the R code for the examples above. In reality, patients can have hundreds of test results, which is why it is important to me to filter the number of rows down to the closest test result prior to every transfusion – col. slade Nov 26 '14 at 17:17
  • Thanks, that is close, but can I need to retain all transfusion dates, and I need the pt's test results for the date closest to every transfusion. (this is why in 'desired output', there are two rows for MRN1 and two different test results – col. slade Nov 26 '14 at 17:35
  • So to clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output – col. slade Nov 26 '14 at 17:45
  • 1
    Use @jaysunice3401's suggestion and then split before each transfusion and grap the last test. – IRTFM Nov 26 '14 at 17:49
  • OK so I guess dfLast currently gives me the last test result and test value for each _patient_, correct? And I need to modify it to give me the last test result for each patient's _transfusion_. So I need to split df_A by transfusion date?? – col. slade Nov 26 '14 at 18:25
  • can you elaborate a bit more on that? I just started using R and I am not familiar with these functions – col. slade Nov 26 '14 at 20:04
  • The `split` function has a data.frame method. – IRTFM Nov 26 '14 at 22:11
0

OK thanks for everyone's help. It took me a lot of toil, blood, sweat, and tears, but this is the solution I came up with:

  1. Merge both data frames:

df_AB <- merge(df_A, df_B, all.x = T)

df_AB:

  MRN Transfusion.Date  Test.Date   Test.Result
1   1       2000-01-01 1999-11-30      negative
2   1       2000-01-01 2000-01-15 700 copies/ul
3   1       2000-01-01 2000-01-27 900 copies/ul
4   1       2000-01-30 1999-11-30      negative
5   1       2000-01-30 2000-01-15 700 copies/ul
6   1       2000-01-30 2000-01-27 900 copies/ul
7   2       2003-04-01 2003-03-30      negative
8   3       2003-04-01       <NA>          <NA>

Using dplyr

df_tests <- df_AB %>% 
  group_by(MRN, Transfusion.Date) %>%
  mutate(Time.Difference = Transfusion.Date - Test.Date) %>%
  filter(Time.Difference > 0) %>%
  arrange(Time.Difference) %>%
  summarize(Test.Date = Test.Date[1], Test.Result = Test.Result[1])

df_tests:

  MRN Transfusion.Date  Test.Date Test.Result
1   1       2000-01-01 1999-11-30    negative
2   1       2000-01-30 1999-11-30    negative
3   2       2003-04-01 2003-03-30    negative

using merge again for MRN3:

df_desired <- merge(df_A, df_tests, all.x = T)

  MRN Transfusion.Date  Test.Date   Test.Result
1   1       2000-01-01 1999-11-30      negative
2   1       2000-01-30 2000-01-27 900 copies/ul
3   2       2003-04-01 2003-03-30      negative
4   3       2003-04-01       <NA>          <NA>
col. slade
  • 451
  • 4
  • 13