Merging rows in R while adding conditions according to column values

Question

I am working with the following dataset : library(tidyverse) library(lubridate)

df <- data.frame(
  icustay_id = c(1, 1, 1, 2, 3),
  starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30",  "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
  endtime = as.POSIXct(c("2019-09-10 13:20", "2019-09-12 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
  vaso_rate = sample(1:10, 5, replace = TRUE),
  vaso_amount = runif(5, 0, 1000)
)

df
#       icustay_id           starttime             endtime vaso_rate vaso_amount
# 1          1 2019-09-10 13:20:00 2019-09-11 13:20:00         3    293.0896
# 2          1 2019-09-11 13:30:00 2019-09-12 01:20:00         9    602.9983
# 3          1 2019-09-14 16:40:00 2019-09-15 16:40:00         4    208.9360
# 4          2 2019-09-10 12:40:00 2019-09-13 13:20:00         2    864.1494
# 5          3 2019-09-10 01:20:00 2019-09-11 13:20:00         9    405.2939

Basically, this shows the starttime and endtime of a medication received by different patients in POSIXCT.

I am trying to build a function that will :

For every single unique patient (every unique icustay_id), merge the rows in which the medication has been stopped for less than an hour.
When the row merges : Some columns will retain the same value (i.e. the patient identifiers) Some columns must be modified :
Keep the earlier starttime
Keep the latter endttime
Average the vaso-rate
Sum the vaso-amount
Delete the durations

I am struggling with the second part, I can't seem to find the optimal way to address this conditional "merge".

To obtain something like :

df
#       icustay_id           starttime             endtime vaso_rate vaso_amount
# 1          1 2019-09-10 13:20:00 2019-09-12 01:20:00         3    293.0896
# 2          1 2019-09-14 16:40:00 2019-09-15 16:40:00         4    208.9360
# 3          2 2019-09-10 12:40:00 2019-09-13 13:20:00         2    864.1494
# 4          3 2019-09-10 01:20:00 2019-09-11 13:20:00         9    405.2939

Notice in this example how patient with icustay 1 : only the events in which the the consecutive endtime-starttime difference was < 1 hour were grouped while the third and more distant event (> 1 hour of difference from the others) was kept ungrouped.

This what I have so far. I tried to add an additional group column for patients who meet the condition above and then eventually group_by this condition.

But it does not work...

merge_pressor_doses <- function(df){
  df %>% arrange(icustay_id,starttime)
  a <- 1
  for (i in unique(df$icustay_id))
    {
    for (j in which(df$icustay_id==i) && j < max(which(df$icustay_id==i)))
      {
        df%>%mutate(group = ifelse(df$starttime[j+1]-df$endtime[j] < 60, a, 0))
      }
    }
  df%>%group_by(group) %>% 
    summarise(
      starttime = min(starttime), 
      endtime = max(endtime),
      vaso_rate = mean(vaso_rate),
      sum_vaso_amount = sum(vaso_amount))
    return(df)
}

Hi, screenshots are not a good way to share data. Here's how to create a good [minimal reproducible example](https://stackoverflow.com/a/5963610/2414988) (ie. share your data to SO to receive help faster). — Biblot, Oct 19 '19 at 15:13
I have trouble understanding the grouping condition from your example. Shouldn't you have no grouping at all, since for `icustay_id == 1`, all observation have `endtime - starttime > 1 hour` except for line 1? — Biblot, Oct 19 '19 at 16:21
Yes you are right @SamuelDiebolt, mistake on my part. I modified the example. — Eric Yamga, Oct 19 '19 at 16:34
Do you mean to summarise rows only when the difference between the `endtime` of one row and the `starttime` of an another is less than one hour? Your question and your code say two different things: your code check for `starttime[j] - starttime[j+1]`, whereas your **Notice** says to check for `endtime - starttime`, without specifying for which rows. — Biblot, Oct 19 '19 at 16:40
Yes! I mean to summarise only the consecutive rows [for the same patient id] for which the difference between start and endtime is less than an hour. The reality is that both framing of the problem are similar. My code is more appropriate has it respect the order of the indexing. — Eric Yamga, Oct 19 '19 at 16:56
Can you then edit your post to specify this condition at the top? Also, the example dataset I provided doesn't have any two consecutive rows with same `icustay_id` where `starttime[row + 1] - endtime[row] < 1 hour`, so you should also create a new sample to illustrate this condition. — Biblot, Oct 19 '19 at 16:59
Also, both framing are different. In the case of `starttime[j+1] - starttime[j] < 120`, you are checking if there's less than a two hour difference between the start of two events, but these events could last as long as you want. In the case of `starttime[j+1] - endtime[j] < 60`, you are checking if there's less than one hour between the start of an event and the end of the event just before, meaning that these events could last as long as you want, but couldn't overlap and need to be close to each other. — Biblot, Oct 19 '19 at 17:07
Yes. I changed my example df to illustrate that and modified my *notice* comment! Thanks again and sorry for the unclarity! — Eric Yamga, Oct 19 '19 at 17:07
No problem, I'm trying to better understand your problem so that I can help you find a good solution! :-) — Biblot, Oct 19 '19 at 17:08
You are right, endtime is not interchangeable here. I will change that also! — Eric Yamga, Oct 19 '19 at 17:09
I think I have a better understanding of the grouping condition now. I would create a new grouping variable that would contain IDs for each row that needs to be grouped, then use my first solution to `group_by` this variable and then summarise. Don't have the time to update right now, but I'll do it when I get back if no one comes up with a better solution! — Biblot, Oct 19 '19 at 17:15
Makes a lot of sense! Thank you. I am new to the game as you can see. I appreciate it ! — Eric Yamga, Oct 19 '19 at 18:03

Biblot · Answer 1 · 2019-10-19T16:41:04.967

Answer to original post (`endtime - starttime < 1H` added later)

What you are looking for isn't row merges, but a summary of some of your columns after grouping by patient. I constructed a toy dataset using your screenshot (please see my comment to learn how to share a sample of your data):

library(tidyverse)
library(lubridate)

df <- data.frame(
  icustay_id = c(1, 1, 1, 2, 3),
  starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-10 15:20",  "2019-09-10 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
  endtime = as.POSIXct(c("2019-10-10 13:20", "2019-12-10 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
  vaso_rate = sample(1:10, 5, replace = TRUE),
  vaso_amount = runif(5, 0, 1000)
)
df
#       icustay_id           starttime             endtime vaso_rate vaso_amount
# 1          1 2019-09-10 13:20:00 2019-10-10 13:20:00         3    293.0896
# 2          1 2019-09-10 15:20:00 2019-12-10 01:20:00         9    602.9983
# 3          1 2019-09-10 16:40:00 2019-09-15 16:40:00         4    208.9360
# 4          2 2019-09-10 12:40:00 2019-09-13 13:20:00         2    864.1494
# 5          3 2019-09-10 01:20:00 2019-09-11 13:20:00         9    405.2939

I then group by patient (icustay_id) and summarise the variables you specified using the summarise function from dplyr.

df %>%
  group_by(icustay_id) %>% 
  summarise(
    starttime       = min(starttime), 
    endtime         = max(endtime),
    avg_vaso_rate   = mean(vaso_rate),
    sum_vaso_amount = sum(vaso_amount)
  )
# A tibble: 3 x 5
#   icustay_id starttime           endtime             avg_vaso_rate sum_vaso_amount
#        <dbl> <dttm>              <dttm>                      <dbl>           <dbl>
# 1          1 2019-09-10 13:20:00 2019-12-10 01:20:00          5.33           1105.
# 2          2 2019-09-10 12:40:00 2019-09-13 13:20:00          2               864.
# 3          3 2019-09-10 01:20:00 2019-09-11 13:20:00          9               405.

Thank you @Samuel! That definitely does the trick! In fact, I had not seen as a summarise/grouping problem. Thanks! — Eric Yamga, Oct 19 '19 at 15:30
@EricYamga: Glad I could help! I should mention that all other variables will be dropped. If you want to keep them, you can add them in the `group_by`, making sure they contain unique values for each patient. Otherwise, you need to create a summary of these variables (mean, sum, etc.) — Biblot, Oct 19 '19 at 15:36
I actually just realized a little problem. The function as it is, does not help in the situation in which the start and end time are > 1 hour. What I am hoping to do is really to groupby for specifically for the situations in which the hour separating the 2 consecutive events are < 1 hour. Is it possible to restrict the grouping to a subset of rows for example ? — Eric Yamga, Oct 19 '19 at 16:00
I updated my question to illustrate what I am trying to point at! Thanks again! — Eric Yamga, Oct 19 '19 at 16:14

Merging rows in R while adding conditions according to column values

1 Answers1

Answer to original post (endtime - starttime < 1H added later)

Answer to original post (`endtime - starttime < 1H` added later)