0

This is a timeline (from timestamp 1 to 66) for commands (ON or OFF) that control the state of a machine (ONLNIE or OFFLINE).

Timeline  ------------|-------------|--------|------|---------|------------>

Command               ON            OFF      OFF    ON        OFF

Note the second OFF command was useless.

Accordingly, the machine state should be:

State     ---OFFLINE--|----ONLINE---|----OFFLINE----|--ONLINE-|--OFFLINE----

Note that the default state is OFFLINE.

I have the following data for 100,000s of machines:

Machine Timestamp    Command
======= =========    =======
1       13           ON
1       27           OFF 
1       36           OFF
1       43           ON
1       53           OFF
...
n       ...          ...

I want to create the following table:

Machine   From    To    State
=======   ====    ==    =====
1         1       13    OFFLINE 
1         13      27    ONLINE
1         27      43    OFFLINE
1         43      53    ONLINE
1         53      66    OFFLINE
...
n         ...     ...   ...

(From is inclusive, and To is exclusive, thanks to @MatBailie)

What is an efficient way to do that in SQL (DB2) or R?

Orion
  • 1,104
  • 3
  • 16
  • 40
  • Don't use inclusive boundaries. Imagine if at some point it becomes important to move to milliseconds instead of seconds? Now your data has gaps in it *(there was no state from 12.001 to 12.999)*. Instead, use inclusive start and exclusive end: `0 to 13 = OFF, 13 to 27 = ON, 27 to 43 = OFF, etc, etc`. – MatBailie May 16 '18 at 13:13
  • @TimBiegeleisen I didn't downvote, I didn't vote at all, I simply made a comment that I thought would be helpful advice to the OP, but isn't an answer to the question. – MatBailie May 16 '18 at 13:15
  • @MatBailie I appreciate your very good advice! – Orion May 16 '18 at 13:17
  • I suggest you look at the tag [gaps-and-islands](https://stackoverflow.com/questions/tagged/gaps-and-islands). That is a common pattern in SQL. – Stavr00 May 16 '18 at 13:33
  • Try the DB2 SQL in this answer.. you should be able to adapt it to your DDL https://stackoverflow.com/questions/50290619/merge-consecutive-duplicate-records-including-time-range/50298205#50298205 – Paul Vernon May 16 '18 at 13:57

3 Answers3

1

If your data is in a data.table object in R you can do this

Example Data

library(data.table)
dt <- fread("
Timestamp    Command
13           ON
27           OFF 
36           OFF
43           ON
53           OFF
")
dt <- rbind(dt, dt)
dt[, Machine := rep(1:2, each = 5)][]

#     Timestamp Command Machine
#  1:        13      ON       1
#  2:        27     OFF       1
#  3:        36     OFF       1
#  4:        43      ON       1
#  5:        53     OFF       1
#  6:        13      ON       2
#  7:        27     OFF       2
#  8:        36     OFF       2
#  9:        43      ON       2
# 10:        53     OFF       2

Output

library(magrittr)

dt[,  .(From = Timestamp 
      , To   = shift(Timestamp, 1, type = 'lead') - 1
      , Command)
   , by = Machine] %>% 
  .[, .(From  = first(From)
      , To    = last(To)
      , State = first(Command))
    , by = .(Machine, rleid(Command))] %>% 
  .[, .(From  = c(1, From)
      , To    = c(From[1] - 1, To)
      , State = c(ifelse(State[1] == 'ON', 'OFF', 'ON'), State))
    , by = Machine]


#     Machine From To State
#  1:       1    1 12   OFF
#  2:       1   13 26    ON
#  3:       1   27 42   OFF
#  4:       1   43 52    ON
#  5:       1   53 NA   OFF
#  6:       2    1 12   OFF
#  7:       2   13 26    ON
#  8:       2   27 42   OFF
#  9:       2   43 52    ON
# 10:       2   53 NA   OFF

Benchmark show it takes ~200-300 milliseconds for this example with 1 million rows. Did this on an HP Laptop.

n = 2
duration = 66

dt <- data.table(Machine = rep(1:n, each = 5),
    Timestamp  = c(c(13,27,36,43,53), c(1,13,27,36,66)),
    Command = c(c(1, 0, 1, 1, 0), c(0, 0, 1, 1, 1)))


dt <- rbindlist(replicate(1e5, dt, simplify = F))
nrow(dt) == 1e6
# TRUE

microbenchmark(useDT(dt))
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
#  useDT(dt) 159.6124 171.5623 265.3555 186.8661 232.0942 1157.086   100
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

Base R solution, OFF/ON are taken as 0/1. It takes ~ 5 sec when n = 500000.

n = 2
duration = 66

# sample data    
df_base = data.frame(Machine = rep(1:n, each = 5),
    Timestamp  = c(c(13,27,36,43,53), c(1,13,27,36,66)),
    Command = c(c(1, 0, 1, 1, 0), c(0, 0, 1, 1, 1)))

# auxiliary data frames
df_start = data.frame(Machine = 1:n,
                    Timestamp = rep(1, n),
                    Command = 0)

df_end = data.frame(Machine = 1:n,
    Timestamp = rep(duration, n),
    Command = NA)

df = rbind.data.frame(df_start, df_base, df_end)
df = df[order(df$Machine, df$Timestamp),]

# remove useless commands
df = df[c(TRUE, diff(df$Command) != 0 | is.na(diff(df$Command))),]

# create to and from columns
output = data.frame(head(df, -1), To = tail(df$Timestamp, -1))

# remove rows where from and to refers to different machines
output = output[!is.na(output$Command),]

# reformat output
output = output[,c("Machine", "Timestamp", "To", "Command")]
colnames(output)[2] = "From"

# deal with rows where state is changed in time 1 or time = duration
output = output[!output$From == output$To,]

head(output, 10)

#   Machine From To Command
#1        1    1 13       0
#3        1   13 27       1
#4        1   27 36       0
#5        1   36 53       1
#7        1   53 66       0
#2        2    1 27       0
#10       2   27 66       1
Ape
  • 1,159
  • 6
  • 11
1

A SQL Answer...

http://sqlfiddle.com/#!18/5606d/2 # QUERY syntax should be the same in DB2

SELECT
  machine_id,
  MAX(       command)    AS state,
  MIN(     timestamp)    AS start_timestamp,
  LEAD(MIN(timestamp))
    OVER (PARTITION BY machine_id
              ORDER BY GroupID
         )
                         AS cease_timestamp  
FROM
(
  SELECT
    *,
    SUM(CASE WHEN prev_command = command THEN 0 ELSE 1 END)
      OVER (PARTITION BY machine_id
                ORDER BY timestamp
           )
        AS GroupID
  FROM
  (
    SELECT
      *,
      LAG(command)
        OVER (PARTITION BY machine_id
                  ORDER BY timestamp
             )
               AS prev_command
    FROM
    (
      SELECT machine_id, timestamp, command FROM machine_events
      UNION ALL
      SELECT machine_id,         1,   'OFF' FROM machine_events GROUP BY machine_id HAVING MIN(timestamp) > 1
    )
      expanded_events
  )
    look_back
)
  grouped
GROUP BY
  machine_id,
  GroupID
ORDER BY
  machine_id,
  GroupID
MatBailie
  • 83,401
  • 18
  • 103
  • 137