0

I'm trying to fill the time-sequence based on start and end of time. Here is the dataset.

structure(list(type = structure(c(3L, 2L, 3L, 2L, 3L, 2L, 1L), .Label = c("", "end_time", "start_time"), class = "factor"), col1 = structure(c(2L, 3L, 3L, 4L, 4L, 5L, 1L), .Label = c("", "2019-07-07 21:53:00", "2019-07-07 21:53:40", "2019-07-07 21:54:40", "2019-07-07 21:56:00" ), class = "factor"), value = structure(c(2L, 2L, 3L, 3L, 4L, 4L, 1L), .Label = c("", "VALUE_1", "VALUE_2", "VALUE_3"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))

        type                col1   value
1 start_time 2019-07-07 21:53:00 VALUE_1
2   end_time 2019-07-07 21:53:40 VALUE_1
3 start_time 2019-07-07 21:53:40 VALUE_2
4   end_time 2019-07-07 21:54:40 VALUE_2
5 start_time 2019-07-07 21:54:40 VALUE_3
6   end_time 2019-07-07 21:56:00 VALUE_3

After setting an interval to fill the sequence, I want to make the filled sequence. For example, in case of 20-seconds, this is my expected table.

col1                 value
2019-07-07 21:53:00 VALUE_1
2019-07-07 21:53:20 VALUE_1
2019-07-07 21:53:40 VALUE_1
2019-07-07 21:53:40 VALUE_2
2019-07-07 21:54:00 VALUE_2
2019-07-07 21:54:20 VALUE_2
2019-07-07 21:54:40 VALUE_2
2019-07-07 21:54:40 VALUE_3
2019-07-07 21:54:00 VALUE_3
2019-07-07 21:55:20 VALUE_3
2019-07-07 21:55:40 VALUE_3
2019-07-07 21:56:00 VALUE_3

How should I make the table using dplyr?

jhyeon
  • 456
  • 4
  • 14

2 Answers2

4

I think the "close-as-dupe" might win, but I think the solutions are a little off in that they don't address the grouping of it. (Note that I adjusted the data, because (1) those aren't POSIXt, and (2) there's an empty bottom row.)

Here's a solution (that starts with the assumption that you have no other columns to preserve, just the start/end times and the value):

library(dplyr)
x %>%
  group_by(value) %>%
  do( data.frame(time = do.call(seq.POSIXt, c(as.list(sort(range(.$col1))), by = 20))) ) %>%
  ungroup()
# # A tibble: 12 x 2
#    value   time                   
#    <fct>   <dttm>                 
#  1 VALUE_1 2019-07-07 21:53:00.000
#  2 VALUE_1 2019-07-07 21:53:20.000
#  3 VALUE_1 2019-07-07 21:53:40.000
#  4 VALUE_2 2019-07-07 21:53:40.000
#  5 VALUE_2 2019-07-07 21:54:00.000
#  6 VALUE_2 2019-07-07 21:54:20.000
#  7 VALUE_2 2019-07-07 21:54:40.000
#  8 VALUE_3 2019-07-07 21:54:40.000
#  9 VALUE_3 2019-07-07 21:55:00.000
# 10 VALUE_3 2019-07-07 21:55:20.000
# 11 VALUE_3 2019-07-07 21:55:40.000
# 12 VALUE_3 2019-07-07 21:56:00.000

Unraveling just a little:

  • group_by does this once for VALUE_1 times, etc;
  • sort(range(.$col1)) just finds the two outer times (which happen to be start/end times, but I'm ignoring that for this solution) ... sorting it is just defensive, as I don't know if you assume and can guarantee that the order is always correct;
  • do.call combines all arguments into a list (formed by the result of range() and adds a third argument by=20 for 20-second spacing), and passes this list of arguments do seq.POSIXt ... For example, if the range returns time1 and time2, then the whole do.call(seq.POSIXt, c(...)) is analogous to seq.POSIXt(time1, time2, by=20).

In dplyr's grouping, the do block effectively replaces the incoming-frame with what we generate inside the do block, which is a frame of "some length" (based on the start/end times and by= interval). Because grouping always preserves the grouping variable, value persists and is repeated across all returned rows.

Alternatively, a data.table solution:

library(data.table)
xDT <- copy(x)
setDT(xDT)
xDT[, .(time = do.call(seq.POSIXt, c(as.list(sort(range(col1))), by = 20))), by = "value"]

Data:

x <- structure(list(type = structure(c(3L, 2L, 3L, 2L, 3L, 2L, 1L), .Label = c("", "end_time", "start_time"), class = "factor"), col1 = structure(c(2L, 3L, 3L, 4L, 4L, 5L, 1L), .Label = c("", "2019-07-07 21:53:00", "2019-07-07 21:53:40", "2019-07-07 21:54:40", "2019-07-07 21:56:00" ), class = "factor"), value = structure(c(2L, 2L, 3L, 3L, 4L, 4L, 1L), .Label = c("", "VALUE_1", "VALUE_2", "VALUE_3"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))
x$col1 <- as.POSIXct(x$col1, format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
x <- x[!is.na(x$col1),]
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    An alternative to the `do` construct is to use `tidyr::complete`: `df %>% mutate(col1 = as.POSIXct(col1, format = "%Y-%m-%d %H:%M:%S")) %>% group_by(value) %>% complete(col1 = seq(col1[1], col1[2], by = 20))` – Maurits Evers Jul 29 '19 at 00:21
  • Thanks @MauritsEvers, and that's even one of the answers in the other ... I am not as familiar with `complete`, I'll try to absorb it into my habits. Thanks! (The fact that I don't know it is why I did not recognize that it was applicable here.) – r2evans Jul 29 '19 at 00:22
  • Ah yes, I saw that too late:-) – Maurits Evers Jul 29 '19 at 00:24
1

This uses @r2evan's dataset - the original has a NA value which messes up things.

library(dplyr)
library (tidyr)
vec_seq <- Vectorize(seq.POSIXt, , vectorize.args = c("from", "to"))

x%>%
  spread(key = 'type', value = 'col1' )%>%
  transmute(value, col1 = vec_seq(start_time, end_time, 20))%>%
  unnest()

Data, again @r2evan:

x <- structure(list(type = structure(c(3L, 2L, 3L, 2L, 3L, 2L, 1L), .Label = c("", "end_time", "start_time"), class = "factor"), col1 = structure(c(2L, 3L, 3L, 4L, 4L, 5L, 1L), .Label = c("", "2019-07-07 21:53:00", "2019-07-07 21:53:40", "2019-07-07 21:54:40", "2019-07-07 21:56:00" ), class = "factor"), value = structure(c(2L, 2L, 3L, 3L, 4L, 4L, 1L), .Label = c("", "VALUE_1", "VALUE_2", "VALUE_3"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))
x$col1 <- as.POSIXct(x$col1, format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
x <- x[!is.na(x$col1),]
Cole
  • 11,130
  • 1
  • 9
  • 24