I would like to use dplyr to add a "delta" column to a dataset. The delta would be computed as the difference between the current row value and the value from a previous row. The challenge is that the immediately preceeding row is not necessarily the right one because some filtering is needed.
Consider this dataset:
LEVEL, TIME
3, 0000
2, 0010
2, 0020
1, 0030
2, 0040
3, 0050
I want to add a new column, DELTA, containing the difference between the TIME value compared with the previous TIME value for a row with the same LEVEL or greater. That is, instead of comparing with the immediately preceeding row, I would like to search backwards and skip over any rows with a lower LEVEL.
For this example the expected output would be:
LEVEL, TIME, DELTA
3, 0000, NA
2, 0010, 10
2, 0020, 10
1, 0030, 10
2, 0040, 20
3, 0050, 50
Can this be done straightforwardly with dplyr? (Or otherwise?)
I would like an efficient solution because my real dataset is approximately one billion rows and has seven timestamp columns (but only one level.)
(Background: The data is from a software application log file using many time sources available from the CPU e.g. cycles, instructions, and L1/L2/L3/DRAM access counters. I want to measure the elapsed time between events. The messages with lower levels are not separate preceeding events but rather finer-grained details.)
EDIT WITH NEW INFORMATION:
None of the solutions I have tried with dplyr actually work with my million-element data set. They seem to be slow and to blow up the R process.
I have fallen back to learning some base R and writing a reasonably practical (~2 seconds for 1M row data frame) implementation like this:
level <- c(3,2,2,1,2,3,6,4,7,8,2) # recycled to 1M elements, below
time <- seq(0, 10000000, 10)
# reference timestamp accumulator for update inside closure.
# index is log level and value is reference timestamp for delta.
ref <- numeric(9)
f <- function(level, time) {
delta <- time - ref[level]
ref[1:level] <<- time
delta
}
delta <- mapply(f, level, time)
Is this reasonable? is there a comparable dplyr solution?
I am basically satisfied. I do feel like this should be ~10x faster, ~5000 CPU cycles per vector element seems a bit insane, but it works for me and is perhaps reasonable in the context of an interpreter that is copying the ref
accumulator on each step.
EDIT2: On reflection the performance of this formulation is a bit of a drag. I would like a 10x speedup if possible!