0

I'm trying to find a way of merging overlapping time intervals that can deal with milliseconds.

Three potential options have been posted here: How to flatten / merge overlapping time periods

However, I don't need to group by ID, and so am finding the dplyr and data.table methods confusing (I'm not sure whether they can deal with milliseconds, as I can't get them to work).

I have managed to get the IRanges solution working, but it converts POSIXct objects to as.numeric integers to calculate the overlaps. So, I'm assuming this is why milliseconds are absent from the output?

The lack of milliseconds doesn't seem to be a display issue, as when I subtract the resulting start and end times, I get integer results in seconds.

Here's a sample of my data:

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df <- data.frame(start, end)

The output I get from the IRanges solution:

                start                 end
1 2019-07-15 21:32:43 2019-07-15 21:32:49
2 2019-07-15 21:32:54 2019-07-15 21:32:54
3 2019-07-15 21:34:08 2019-07-15 21:34:10

And the desired result:

                    start                     end
1 2019-07-15 21:32:43.565 2019-07-15 21:32:49.045
2 2019-07-15 21:32:54.301 2019-07-15 21:32:54.801
3 2019-07-15 21:34:08.506 2019-07-15 21:34:10.236

Suggestions would be very much appreciated!

pajul
  • 123
  • 9

2 Answers2

0

I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.

Here's a reproducible example.

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df    <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

i <- 1

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

while(i < nrow(df))
{
  overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
  if(length(overlaps) > 1)
  {
    df$end[i] <- max(df$end[overlaps])
    df <- df[-overlaps[-which(overlaps == i)], ]
    i <- i - 1
  }
  i <- i + 1
}

So now our data frame doesn't have overlaps:

df
#>                 start                 end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10

Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:

df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73

as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73

Created on 2020-02-20 by the reprex package (v0.3.0)

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
0

I think the best thing to do here is to use the clock package (for a true sub-second precision date-time type) along with the ivs package (for merging overlapping intervals).

Using POSIXct for sub-second date-times can be a bit challenging for various reasons, which I've talked about here.

The key here is iv_groups(), which merges all overlapping intervals and returns the intervals that remain after all of the overlaps have been merged. It is also backed by a C implementation that is very fast.

library(clock)
library(ivs)
library(dplyr)

df <- tibble(
  start = c(
    "2019-07-15 21:32:43.565", "2019-07-15 21:32:43.634",
    "2019-07-15 21:32:54.301", "2019-07-15 21:34:08.506",
    "2019-07-15 21:34:09.957"
  ),
  end = c(
    "2019-07-15 21:32:48.445", "2019-07-15 21:32:49.045",
    "2019-07-15 21:32:54.801", "2019-07-15 21:34:10.111",
    "2019-07-15 21:34:10.236"
  )
)

# Parse into "naive time" (i.e. with a yet-to-be-defined time zone)
# using a millisecond precision
df <- df %>%
  mutate(
    start = naive_time_parse(start, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
    end = naive_time_parse(end, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
  )

df
#> # A tibble: 5 × 2
#>   start                   end                    
#>   <tp<naive><milli>>      <tp<naive><milli>>     
#> 1 2019-07-15T21:32:43.565 2019-07-15T21:32:48.445
#> 2 2019-07-15T21:32:43.634 2019-07-15T21:32:49.045
#> 3 2019-07-15T21:32:54.301 2019-07-15T21:32:54.801
#> 4 2019-07-15T21:34:08.506 2019-07-15T21:34:10.111
#> 5 2019-07-15T21:34:09.957 2019-07-15T21:34:10.236

# Now combine these start/end boundaries into a single interval vector
df <- df %>%
  mutate(interval = iv(start, end), .keep = "unused")

df
#> # A tibble: 5 × 1
#>                                             interval
#>                               <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:48.445)
#> 2 [2019-07-15T21:32:43.634, 2019-07-15T21:32:49.045)
#> 3 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 4 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.111)
#> 5 [2019-07-15T21:34:09.957, 2019-07-15T21:34:10.236)

# And use `iv_groups()` to merge all overlapping intervals.
# It returns the remaining intervals after all overlaps have been removed.
df %>%
  summarise(interval = iv_groups(interval))
#> # A tibble: 3 × 1
#>                                             interval
#>                               <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:49.045)
#> 2 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 3 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.236)

Created on 2022-04-05 by the reprex package (v2.0.1)

Davis Vaughan
  • 2,780
  • 9
  • 19