0

I have a dataframe containing the start and end timestamps of events. I would like to create a sequence of timestamps to fill the dates between these events. Here's the structure of my data:

dat <- structure(list(event_id = 1:2,
                      start_time = structure(c(1617346800,1617348000),
                                             class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                      end_time = structure(c(1617347400, 1617348300),
                                           class = c("POSIXct", "POSIXt"), tzone = "UTC")),
                 class = "data.frame", row.names = c(NA, -2L))

What I'm hoping to do is lengthen the data frame so there are as many rows as there are minutes between the two events. So you'd end up with something like this:

final <- structure(list(event_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 2L, 2L, 2L, 2L, 
2L), start_time = structure(c(1617346800, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1617348000, 
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    end_time = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1617347400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 1617348300), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    record_time = c("2/4/2021 7:00", "2/4/2021 7:01", "2/4/2021 7:02", 
    "2/4/2021 7:03", "2/4/2021 7:04", "2/4/2021 7:05", "2/4/2021 7:06", 
    "2/4/2021 7:07", "2/4/2021 7:08", "2/4/2021 7:09", "2/4/2021 7:10", 
    "2/4/2021 7:11", "2/4/2021 7:12", "2/4/2021 7:13", "2/4/2021 7:14", 
    "2/4/2021 7:15", "2/4/2021 7:16", "2/4/2021 7:17", "2/4/2021 7:18", 
    "2/4/2021 7:19", "2/4/2021 7:20", "2/4/2021 7:21", "2/4/2021 7:22", 
    "2/4/2021 7:23", "2/4/2021 7:24", "2/4/2021 7:25")), class = "data.frame", row.names = c(NA, 
-26L))

So far what I've come up with is creating a dataframe of the full sequence:

timeline <- as.POSIXct(
                  seq.POSIXt(from = min(dat$start_time),
                       to = max(dat$end_time), by = "min"))

From here, I'm stuck as to how to join the two. Using a for loop I can make a list of sequences, though at that point, I'm still not sure how to get both these things together (also, I'm sure it could be done better with purrr but I'm not good with purrr yet).

event_timelines <- list()

for (row in 1:nrow(dat)) {
  event_timelines[[row]] <- seq.POSIXt(from = dat[row,]$start_time,
                                       to = dat[row,]$end_time,
                                       by = "min")
}

Thanks!

setty
  • 425
  • 3
  • 18

3 Answers3

1

Try this dplyr approach (with fuzzyjoin).

library(dplyr)

timerange <- range(unlist(dat[,c("start_time", "end_time")]))
attributes(timerange) <- attributes(dat$start_time)

tibble(record_time = seq(timerange[1], timerange[2], by = "min")) %>%
  fuzzyjoin::fuzzy_full_join(
    dat,
    by = c("record_time" = "start_time", "record_time" = "end_time"),
    match_fun = list(`>=`, `<=`)
  ) %>%
  group_by(event_id) %>%
  mutate(
    start_time = start_time[c(1, rep(NA, n()-1))],
    end_time = end_time[c(rep(NA, n()-1), 1)]
  ) %>%
  arrange(record_time) %>%
  ungroup() %>%
  print(n=99)
# # A tibble: 26 x 4
#    record_time         event_id start_time          end_time           
#    <dttm>                 <int> <dttm>              <dttm>             
#  1 2021-04-02 07:00:00        1 2021-04-02 07:00:00 NA                 
#  2 2021-04-02 07:01:00        1 NA                  NA                 
#  3 2021-04-02 07:02:00        1 NA                  NA                 
#  4 2021-04-02 07:03:00        1 NA                  NA                 
#  5 2021-04-02 07:04:00        1 NA                  NA                 
#  6 2021-04-02 07:05:00        1 NA                  NA                 
#  7 2021-04-02 07:06:00        1 NA                  NA                 
#  8 2021-04-02 07:07:00        1 NA                  NA                 
#  9 2021-04-02 07:08:00        1 NA                  NA                 
# 10 2021-04-02 07:09:00        1 NA                  NA                 
# 11 2021-04-02 07:10:00        1 NA                  2021-04-02 07:10:00
# 12 2021-04-02 07:11:00       NA NA                  NA                 
# 13 2021-04-02 07:12:00       NA NA                  NA                 
# 14 2021-04-02 07:13:00       NA NA                  NA                 
# 15 2021-04-02 07:14:00       NA NA                  NA                 
# 16 2021-04-02 07:15:00       NA NA                  NA                 
# 17 2021-04-02 07:16:00       NA NA                  NA                 
# 18 2021-04-02 07:17:00       NA NA                  NA                 
# 19 2021-04-02 07:18:00       NA NA                  NA                 
# 20 2021-04-02 07:19:00       NA NA                  NA                 
# 21 2021-04-02 07:20:00        2 2021-04-02 07:20:00 NA                 
# 22 2021-04-02 07:21:00        2 NA                  NA                 
# 23 2021-04-02 07:22:00        2 NA                  NA                 
# 24 2021-04-02 07:23:00        2 NA                  NA                 
# 25 2021-04-02 07:24:00        2 NA                  NA                 
# 26 2021-04-02 07:25:00        2 NA                  2021-04-02 07:25:00

The reason I chose to use fuzzyjoin is because POSIXt are floating-point numeric-like with some attributes, and sometimes (due to general computing challenges, see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), "floating-point equality" is not always guaranteed. In this case, fuzzyjoin is doing a range-based join, effectively is record_time between start_time and end_time.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Your answer works as does another one given at about the same time. How will I decide which one is the best?! – setty Feb 05 '21 at 17:28
  • It's all relative. The solutions all have their merit (imo), and none of us appear to be starving for rep ;-), so pick the one that fits best into your flow and will be the most robust to what you expect in your data. (That is, sometimes floating-point safeguards like mine may not be necessary if you're confident in your data. \*shrug\*) – r2evans Feb 05 '21 at 17:32
  • 1
    The reason I ended up choosing this answer was because it still worked even if events overlapped. This results in duplicate timestamps, but that ends up working in my favor for plotting purposes. – setty Feb 08 '21 at 22:26
1

A way using dplyr and tidyr :

Expand the sequence using complete, fill the NA's with previous non-NA values and use replace to adjust the output to make it similar to expected output.

library(dplyr)
library(tidyr)

dat %>%
  mutate(record_time = start_time) %>%
  complete(record_time = seq(min(start_time), max(end_time), by = '1 min')) %>%
  fill(everything()) %>%
  mutate(event_id = replace(event_id, record_time >  end_time, NA),
        across(c(start_time, end_time), ~replace(., . != record_time, NA)))


#           record_time event_id          start_time            end_time
#1  2021-04-02 07:00:00        1 2021-04-02 07:00:00                <NA>
#2  2021-04-02 07:01:00        1                <NA>                <NA>
#3  2021-04-02 07:02:00        1                <NA>                <NA>
#4  2021-04-02 07:03:00        1                <NA>                <NA>
#5  2021-04-02 07:04:00        1                <NA>                <NA>
#6  2021-04-02 07:05:00        1                <NA>                <NA>
#7  2021-04-02 07:06:00        1                <NA>                <NA>
#8  2021-04-02 07:07:00        1                <NA>                <NA>
#9  2021-04-02 07:08:00        1                <NA>                <NA>
#10 2021-04-02 07:09:00        1                <NA>                <NA>
#11 2021-04-02 07:10:00        1                <NA> 2021-04-02 07:10:00
#12 2021-04-02 07:11:00       NA                <NA>                <NA>
#13 2021-04-02 07:12:00       NA                <NA>                <NA>
#14 2021-04-02 07:13:00       NA                <NA>                <NA>
#15 2021-04-02 07:14:00       NA                <NA>                <NA>
#16 2021-04-02 07:15:00       NA                <NA>                <NA>
#17 2021-04-02 07:16:00       NA                <NA>                <NA>
#18 2021-04-02 07:17:00       NA                <NA>                <NA>
#19 2021-04-02 07:18:00       NA                <NA>                <NA>
#20 2021-04-02 07:19:00       NA                <NA>                <NA>
#21 2021-04-02 07:20:00        2 2021-04-02 07:20:00                <NA>
#22 2021-04-02 07:21:00        2                <NA>                <NA>
#23 2021-04-02 07:22:00        2                <NA>                <NA>
#24 2021-04-02 07:23:00        2                <NA>                <NA>
#25 2021-04-02 07:24:00        2                <NA>                <NA>
#26 2021-04-02 07:25:00        2                <NA> 2021-04-02 07:25:00
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

You may stick to your nice seq.POSIX approach. Use by to follow a split-apply-combine approach. For each event id calculate the sequence. Triple the result and, using match, set additional values to NA in the old columns.

res <- do.call(rbind, by(dat, dat$event_id, function(x) {
  s1 <- s2 <- s3 <- do.call(seq, c(as.list(unname(x[-1])), "min"))
  s1[-match(x[2], s1)] <- NA
  s2[-match(x[3], s2)] <- NA
  data.frame(event_id=as.integer(x[1]), start_time=s1, end_time=s2, record_time=s3)
}))
res
#      event_id          start_time            end_time         record_time
# 1.1         1 2021-04-02 07:00:00                <NA> 2021-04-02 07:00:00
# 1.2         1                <NA>                <NA> 2021-04-02 07:01:00
# 1.3         1                <NA>                <NA> 2021-04-02 07:02:00
# 1.4         1                <NA>                <NA> 2021-04-02 07:03:00
# 1.5         1                <NA>                <NA> 2021-04-02 07:04:00
# 1.6         1                <NA>                <NA> 2021-04-02 07:05:00
# 1.7         1                <NA>                <NA> 2021-04-02 07:06:00
# 1.8         1                <NA>                <NA> 2021-04-02 07:07:00
# 1.9         1                <NA>                <NA> 2021-04-02 07:08:00
# 1.10        1                <NA>                <NA> 2021-04-02 07:09:00
# 1.11        1                <NA> 2021-04-02 07:10:00 2021-04-02 07:10:00
# 2.1         2 2021-04-02 07:20:00                <NA> 2021-04-02 07:20:00
# 2.2         2                <NA>                <NA> 2021-04-02 07:21:00
# 2.3         2                <NA>                <NA> 2021-04-02 07:22:00
# 2.4         2                <NA>                <NA> 2021-04-02 07:23:00
# 2.5         2                <NA>                <NA> 2021-04-02 07:24:00
# 2.6         2                <NA> 2021-04-02 07:25:00 2021-04-02 07:25:00
jay.sf
  • 60,139
  • 8
  • 53
  • 110