1

I have a table (dt1) of fairly regular time points, like this:

      DATE        TIME    READING
2022-02-02    11:50:23     123.34
2022-02-02    11:50:59     125.66
2022-02-02    11:51:16     159.23
2022-02-02    11:52:34     234.22

etc... Note that the interval between time points can be very irregular.

I have another table (dt2) specifying the start and end times of treatments:

      DATE    RED_START     RED_END    GREEN_START    GREEN_END
2022-02-02     11:50:30    11:51:12       11:52:12     11:53:17
2022-02-02     11:54:10    11:55:09       11:56:30     11:57:15

It specifies when the treatment was "RED" and when it was "GREEN". Note that there are certain time periods that are neither "RED" or "GREEN".

I'm trying to add a column to the first table to say what the treatment was at the time point given for each row. The desired output would look like this:

      DATE        TIME    READING   TREATMENT
2022-02-02    11:50:23     123.34        NULL
2022-02-02    11:50:59     125.66         RED
2022-02-02    11:51:16     159.23        NULL
2022-02-02    11:52:34     234.22       GREEN

I've tried using a non-equi join between the two tables to generate helper columns. e.g.

dt1[, RED := c("TRUE","FALSE")[
    dt2[.SD, on=.(DATE, RED_START<=TIME, RED_END>=TIME),
               by=.EACHI]
]]

But I've not got the hang of the syntax and it's not working. I'm not sure if I'm stating the multiple conditions incorrectly, or if the non-equi join requires additional info. I would prefer answers using data.table if possible, to keep it consistent with the rest of my code.

rw2
  • 1,549
  • 1
  • 11
  • 20
  • Can your `START` and `END` times span midnight? e.g. could you have `RED_START` on `2022-02-02` at `23:59:00` and `RED_END` on `2022-02-03` at `00:00:01`? If not you can do a join on `DATE` and a rolling join on `TIME`, but if so you probably need to use a datetime class like `POSIXct`, which will be slower. – SamR Mar 21 '23 at 15:40
  • It seems odd that you have separate `DATE` and `TIME` fields due to the issue SamR highlighted. Is there a particular reason you don't have a single `POSIXt` field for each? From a data-management perspective, keeping it in one field (for these purposes) would be _far_ simpler. – r2evans Mar 21 '23 at 16:08
  • That's just how the data comes. I can easily combined them into a single datetime column. Time spans don't generally span midnight, the machine is off at night – rw2 Mar 21 '23 at 17:23

1 Answers1

2

This is a case for data.table::foverlaps().

However, your data in an odd format, particularly dt2.

Desired format

You need dt1 to have a combined datetime column, and a dummy end time (to create a window to overlap with dt2).

#          DATE     TIME READING            datetime               dummy
#        <IDat>  <ITime>   <num>              <POSc>              <POSc>
# 1: 2022-02-02 11:50:23  123.34 2022-02-02 11:50:23 2022-02-02 11:50:23
# 2: 2022-02-02 11:50:59  125.66 2022-02-02 11:50:59 2022-02-02 11:50:59
# 3: 2022-02-02 11:51:16  159.23 2022-02-02 11:51:16 2022-02-02 11:51:16
# 4: 2022-02-02 11:52:34  234.22 2022-02-02 11:52:34 2022-02-02 11:52:34

dt2 is in a bizarre half-long, half-wide format. It should look like this:

#     color               start                 end
#    <char>              <POSc>              <POSc>
# 1:    red 2022-02-02 11:50:30 2022-02-02 11:51:12
# 2:  green 2022-02-02 11:52:12 2022-02-02 11:53:17
# 3:    red 2022-02-02 11:54:10 2022-02-02 11:55:09
# 4:  green 2022-02-02 11:56:30 2022-02-02 11:57:15

Getting the data into shape

I love data.table, but this is one of those data wrangling cases where the tidyverse would be easier. This works but it's not beautiful:

as_datetime <- function(x, y) {
    as.POSIXct(paste(x, y))
}

dt1[, datetime := as_datetime(DATE, TIME)][
    ,
    dummy := datetime
]

time_cols <- grep("START|END", names(dt2), value = TRUE)
dt2 <- dt2[, (tolower(time_cols)) := lapply(
    .SD,
    \(y) as_datetime(DATE, y)
), .SDcols = time_cols] |>
    melt(
        id.vars = "DATE",
        measure.vars = patterns("[a-z]_[a-z]")
    )
dt2[
    ,
    c("color", "period") := tstrsplit(variable, "_")
]

dt2 <- dcast(dt2, color ~ period, fun.aggregate = list)[, .(
    color = rep(color, times = lengths(end)),
    start = as.POSIXct(unlist(start), origin = "1970-01-01"),
    end = as.POSIXct(unlist(end), origin = "1970-01-01")
)]

The join

Once the data is in the right format, it's simply a case of:

setkey(dt2, start, end)
foverlaps(dt1, dt2, by.x = c("datetime", "dummy"))[
    ,
    .(datetime, READING, start, end, color)
]

#               datetime READING               start                 end  color
#                 <POSc>   <num>              <POSc>              <POSc> <char>
# 1: 2022-02-02 11:50:23  123.34                <NA>                <NA>   <NA>
# 2: 2022-02-02 11:50:59  125.66 2022-02-02 11:50:30 2022-02-02 11:51:12    red
# 3: 2022-02-02 11:51:16  159.23                <NA>                <NA>   <NA>
# 4: 2022-02-02 11:52:34  234.22 2022-02-02 11:52:12 2022-02-02 11:53:17  green
SamR
  • 8,826
  • 3
  • 11
  • 33
  • Thanks, the join just did what I want. By the way I was able to wrangle the data into the correct format much more simply just by splitting dt2 into a RED table (Date, RED_START, RED_END) and a GREEN table (Date, GREEN_START, GREEN_END), renaming the start and end columns to be consistent and then using rbind to stick them back together. – rw2 Mar 22 '23 at 12:16
  • @rw2 nice! I thought about doing that but I wasn't sure what your real requirements were - this code should work if you have fifty colors. But if you don't then your way is much simpler! – SamR Mar 22 '23 at 12:17
  • 1
    The next data.table version (1.14.9) will have a `measure()` function to use in `melt()`, which should make the data transformation of `dt2` a lot easier aswell. Then you can use `melt(dt2, measure.vars = measure(color, value.name, sep = "_"))`, to get the data in the right format. You still need to combine date and start/end time together, but everything else should be fine. – Gilean0709 Mar 22 '23 at 14:10
  • @Gilean0709 that sounds very useful - do you have a link to any more info about this? I googled it but I couldn't find anything. – SamR Mar 22 '23 at 14:44
  • 1
    You can read up on it in the reshape vignette (https://rdatatable.gitlab.io/data.table/articles/datatable-reshape.html) or the reference site for `measure()` (https://rdatatable.gitlab.io/data.table/reference/measure.html). There are also some issues on the data.table github, which were useful to understand the idea behind it and how to use some of its arguments. – Gilean0709 Mar 22 '23 at 15:05
  • 1
    @SamR Thanks, that solution does look useful for multiple colours. – rw2 Mar 22 '23 at 17:18
  • @Gilean0709 Thanks for the info, I'll check out `measure` and that vignette – rw2 Mar 22 '23 at 17:19