0

I am trying to bring rows from a list within the tibble I have based on conditions on different dates I have. I am hoping to solve this with tidyverse library.

Here is a type of a data I have:

 table_age <- structure(list(id = c(1, 1, 2, 3, 4, 5, 6), age_band = c("5_9", 
"5_9", "10_14", "15-19", "20-24", "5_9", "10_14"), start_date = c("2020-01-01 00:08", 
"2020-02-01 00:00", "2020-01-08 10:08", "2020-01-02 17:08", "2020-01-08 16:08", 
"2020-01-10 08:08", "2020-01-03 09:08"), end_date = c("2020-01-04 10:08", 
"2020-02-11 00:00", "2020-01-09 10:08", "2020-01-03 19:08", "2020-01-11 16:08", 
"2019-01-30 08:08", "2020-01-05 09:08")), row.names = c(NA, -7L
), class = c("tbl_df", "tbl", "data.frame"))

And it looks like this:

table_age
# A tibble: 7 x 4
     id age_band start_date       end_date        
  <dbl> <chr>    <chr>            <chr>           
1     1 5_9      2020-01-01 00:08 2020-01-04 10:08
2     1 5_9      2020-02-01 00:00 2020-02-11 00:00
3     2 10_14    2020-01-08 10:08 2020-01-09 10:08
4     3 15-19    2020-01-02 17:08 2020-01-03 19:08
5     4 20-24    2020-01-08 16:08 2020-01-11 16:08
6     5 5_9      2020-01-10 08:08 2019-01-30 08:08
7     6 10_14    2020-01-03 09:08 2020-01-05 09:08
> 

And a second data type I have is:

structure(list(id = c(1, 1, 2, 2, 3, 4, 5, 6), med_name_one = c("Co-amoxiclav", 
"doxycycline", "Gentamicin", "Co-trimoxazole", "Gentamicin", 
"Co-trimoxazole", "Sodium Chloride", "Piperacillin"), med_name_two = c(NA, 
"Gentamicin", "Co-trimoxazole", NA, NA, NA, NA, NA), mg_one = c("411 mg", 
"120 mg", "11280 mg", "8 mg", "11280 mg", "8 mg", "411 mg", "120 mg"
), mg_two = c(NA, "11280 mg", "8 mg", NA, NA, NA, NA, NA), administration_datetime = c("2020-01-03 10:08", 
"2020-01-01 11:08", "2020-01-02 19:08", "2020-01-08 20:08", "2020-01-02 19:08", 
"2020-01-08 20:08", "2019-01-30 08:08", "2020-01-03 09:08")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

And the look of it:

    table
# A tibble: 8 x 6
     id med_name_one    med_name_two   mg_one   mg_two   administration_datetime
  <dbl> <chr>           <chr>          <chr>    <chr>    <chr>                  
1     1 Co-amoxiclav    NA             411 mg   NA       2020-01-03 10:08       
2     1 doxycycline     Gentamicin     120 mg   11280 mg 2020-01-01 11:08       
3     2 Gentamicin      Co-trimoxazole 11280 mg 8 mg     2020-01-02 19:08       
4     2 Co-trimoxazole  NA             8 mg     NA       2020-01-08 20:08       
5     3 Gentamicin      NA             11280 mg NA       2020-01-02 19:08       
6     4 Co-trimoxazole  NA             8 mg     NA       2020-01-08 20:08       
7     5 Sodium Chloride NA             411 mg   NA       2019-01-30 08:08       
8     6 Piperacillin    NA             120 mg   NA       2020-01-03 09:08  

Now the conditions in which I have to merge these 2 datasets are:

  1. if the administration_datetime in table is within the intervals of start_date and end_date from table_age then merge all the info from both tibles

  2. if the administration_datetime in table is outside of the intervals of start_date and end_date from table_age, keep id info from table_age and give NA for the rows in table. The reason for keeping this types of rows is because they occur in table_age and do not want to loose the information from this specific table. It seems the opposite condition in point 3, down bellow but it is not.

  3. yet, if the administration_datetimein table is outside of both date time intervals in the table_age, then do not merge those rows at all. Because if I do, then the interval date_times will repeat itself. And I do not want this happening.

This is the type of table I want:

table_answer
# A tibble: 8 x 9
     id med_name_one   med_name_two mg_one   mg_two   administration_datetime age_band start_date       end_date        
  <dbl> <chr>          <chr>        <chr>    <chr>    <chr>                   <chr>    <chr>            <chr>           
1     1 Co-amoxiclav   NA           411 mg   NA       2020-01-03 10:08        5_9      2020-01-01 00:08 2020-01-04 10:08
2     1 NA             NA           NA       NA       NA                      5_9      2020-02-01 00:00 2020-02-11 00:00
3     1 doxycycline    Gentamicin   120 mg   11280 mg 2020-01-01 11:08        5_9      2020-01-01 00:08 2020-01-04 10:08
4     2 Co-trimoxazole NA           8 mg     NA       2020-01-08 20:08        10_14    2020-01-08 10:08 2020-01-09 10:08
5     3 Gentamicin     NA           11280 mg NA       2020-01-02 19:08        15-19    2020-01-02 17:08 2020-01-03 19:08
6     4 Co-trimoxazole NA           8 mg     NA       2020-01-08 20:08        20-24    2020-01-08 16:08 2020-01-11 16:08
7     5 NA             NA           NA       NA       NA                      5_9      2020-01-10 08:08 2019-01-30 08:08
8     6 Piperacillin   NA           120 mg   NA       2020-01-03 09:08        10_14    2020-01-03 09:08 2020-01-05 09:08
  1. I have tried to use the inner_join from tidyverse and did not succeed table_joined <- inner_join(table_age, table)

Check this out:

table_joined
# A tibble: 10 x 9
      id age_band start_date       end_date        med_name_one    med_name_two   mg_one  mg_two  administration_dateti…
   <dbl> <chr>    <chr>            <chr>           <chr>           <chr>          <chr>   <chr>   <chr>                 
 1     1 5_9      2020-01-01 00:08 2020-01-04 10:… Co-amoxiclav    NA             411 mg  NA      2020-01-03 10:08      
 2     1 5_9      2020-01-01 00:08 2020-01-04 10:… doxycycline     Gentamicin     120 mg  11280 … 2020-01-01 11:08      
 3     1 5_9      2020-02-01 00:00 2020-02-11 00:… Co-amoxiclav    NA             411 mg  NA      2020-01-03 10:08      
 4     1 5_9      2020-02-01 00:00 2020-02-11 00:… doxycycline     Gentamicin     120 mg  11280 … 2020-01-01 11:08      
 5     2 10_14    2020-01-08 10:08 2020-01-09 10:… Gentamicin      Co-trimoxazole 11280 … 8 mg    2020-01-02 19:08      
 6     2 10_14    2020-01-08 10:08 2020-01-09 10:… Co-trimoxazole  NA             8 mg    NA      2020-01-08 20:08      
 7     3 15-19    2020-01-02 17:08 2020-01-03 19:… Gentamicin      NA             11280 … NA      2020-01-02 19:08      
 8     4 20-24    2020-01-08 16:08 2020-01-11 16:… Co-trimoxazole  NA             8 mg    NA      2020-01-08 20:08      
 9     5 5_9      2020-01-10 08:08 2019-01-30 08:… Sodium Chloride NA             411 mg  NA      2019-01-30 08:08      
10     6 10_14    2020-01-03 09:08 2020-01-05 09:… Piperacillin    NA             120 mg  NA      2020-01-03 09:08   
  1. I have tried to nest the table into table_age and then try to get the rows wanted with a function I have created and then apply the map function from purrr library:

    table_nested <- nest_join(table_age, table)
    
    
    get_medication_name <- function(medication_name_df) {
    
     medication_name <- medication_name_df %>%
     dplyr::group_by(id) %>%
     dplyr::arrange(administered_datetime) %>%
     pull(med_name_one)
    
    }
    
    
    table_answer <- mutate(medication_name = purrr::map(table_nested, get_medication_name))
    

And yet as I am aware I do not put any conditions there, I have at least hoped to get a bit closer with this function. On top of this I get all sorts of errors.

Is there a way to achieve this, yet I will favor an outcome along the lines of my 2nd solution. Of course I can go with along other lines but with second I can build on something more extensive I have already.

GaB
  • 1,076
  • 2
  • 16
  • 29

1 Answers1

1

You could use data.table which allows non-equi joins.

Note that the character dates need to be converted to POSIXct so that non-equi joins work.

library(data.table)                                                                                                                                                                                                                                                                                                                                  -8L), class = c("tbl_df", "tbl", "data.frame"))
setDT(table_age)
setDT(table_)
table_[,administration_datetime_join:=as.POSIXct(administration_datetime)]
table_[,administration_datetime_join:=as.POSIXct(administration_datetime)]
table_age[,start_date_join:=as.POSIXct(start_date)]
table_age[,end_date_join:=as.POSIXct(end_date)]
table_[table_age,on=.(id=id,administration_datetime_join>=start_date_join,administration_datetime_join<=end_date_join)][
                     ,.(id, age_band, start_date, end_date, med_name_one, med_name_two, mg_one, mg_two, administration_datetime)]

   id age_band       start_date         end_date   med_name_one med_name_two   mg_one   mg_two administration_datetime
1:  1      5_9 2020-01-01 00:08 2020-01-04 10:08   Co-amoxiclav         <NA>   411 mg     <NA>        2020-01-03 10:08
2:  1      5_9 2020-01-01 00:08 2020-01-04 10:08    doxycycline   Gentamicin   120 mg 11280 mg        2020-01-01 11:08
3:  1      5_9 2020-02-01 00:00 2020-02-11 00:00           <NA>         <NA>     <NA>     <NA>                    <NA>
4:  2    10_14 2020-01-08 10:08 2020-01-09 10:08 Co-trimoxazole         <NA>     8 mg     <NA>        2020-01-08 20:08
5:  3    15-19 2020-01-02 17:08 2020-01-03 19:08     Gentamicin         <NA> 11280 mg     <NA>        2020-01-02 19:08
6:  4    20-24 2020-01-08 16:08 2020-01-11 16:08 Co-trimoxazole         <NA>     8 mg     <NA>        2020-01-08 20:08
7:  5      5_9 2020-01-10 08:08 2019-01-30 08:08           <NA>         <NA>     <NA>     <NA>                    <NA>
8:  6    10_14 2020-01-03 09:08 2020-01-05 09:08   Piperacillin         <NA>   120 mg     <NA>        2020-01-03 09:08
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • why administration_table? – GaB Apr 09 '21 at 18:47
  • table is a reserved keyword, I'll take table_ – Waldi Apr 09 '21 at 18:48
  • I got this error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 17 rows; more than 15 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. – GaB Apr 09 '21 at 18:48
  • Did you run this on another dataset? – Waldi Apr 09 '21 at 18:53
  • it does not work. Also, it should not give me 17 rows, it should give me only 8 rows. As you can see in chunk code , the fifth, right after conditions - table_answer – GaB Apr 09 '21 at 18:59
  • I forgot to join the ids, see my edit. To make it work on a larger dataset, just set: `options(datatable.allow.cartesian=TRUE)`. – Waldi Apr 09 '21 at 19:02
  • as I am trying to fit your code into the wider program I've written, I struggle with it as it does not give me what I want. Is there a way to help me with tidyverse library along the lines of the second method? Or at least with tidyverse library? – GaB Apr 09 '21 at 21:28
  • @Gab, `data.table` allows non-equi joins as this is the case here, and is known to be very efficient, [see](https://stackoverflow.com/a/27718317/13513328). I don't know of an easy equivalent solution in `tidyverse`. If you elaborate on the problems you're encountering, I could perhaps help you – Waldi Apr 10 '21 at 05:59
  • thank you a lot for your solution - it did work nicely and sooo fast but on its own rather than adding this into another wider program written before with tidyverse. Yet I need to understand where the code does what it does. More specifically, I need to understand what on =. and ., does – GaB Apr 11 '21 at 18:36
  • 1
    `.()` is a shorthand for `list()` in `data.table`. The `on=` is a list describing the non-equijoin : same `id`, `administration_datetime_join>=start_date_join`, ... left are the fields of the LHS table, and right the fields of the RHS table. I hope this is clear, don't hesitate if you need more info. This [link](https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/) might also help. – Waldi Apr 11 '21 at 18:51