3

I would like to use dplyr to add a "delta" column to a dataset. The delta would be computed as the difference between the current row value and the value from a previous row. The challenge is that the immediately preceeding row is not necessarily the right one because some filtering is needed.

Consider this dataset:

LEVEL, TIME
3,     0000
2,     0010
2,     0020
1,     0030
2,     0040
3,     0050

I want to add a new column, DELTA, containing the difference between the TIME value compared with the previous TIME value for a row with the same LEVEL or greater. That is, instead of comparing with the immediately preceeding row, I would like to search backwards and skip over any rows with a lower LEVEL.

For this example the expected output would be:

LEVEL, TIME, DELTA
3,     0000, NA
2,     0010, 10
2,     0020, 10
1,     0030, 10
2,     0040, 20
3,     0050, 50

Can this be done straightforwardly with dplyr? (Or otherwise?)

I would like an efficient solution because my real dataset is approximately one billion rows and has seven timestamp columns (but only one level.)

(Background: The data is from a software application log file using many time sources available from the CPU e.g. cycles, instructions, and L1/L2/L3/DRAM access counters. I want to measure the elapsed time between events. The messages with lower levels are not separate preceeding events but rather finer-grained details.)

EDIT WITH NEW INFORMATION:

None of the solutions I have tried with dplyr actually work with my million-element data set. They seem to be slow and to blow up the R process.

I have fallen back to learning some base R and writing a reasonably practical (~2 seconds for 1M row data frame) implementation like this:

level <- c(3,2,2,1,2,3,6,4,7,8,2) # recycled to 1M elements, below
time <- seq(0, 10000000, 10)

# reference timestamp accumulator for update inside closure.
# index is log level and value is reference timestamp for delta.
ref <- numeric(9)
f <- function(level, time) {
  delta <- time - ref[level]
  ref[1:level] <<- time
  delta
}

delta <- mapply(f, level, time)

Is this reasonable? is there a comparable dplyr solution?

I am basically satisfied. I do feel like this should be ~10x faster, ~5000 CPU cycles per vector element seems a bit insane, but it works for me and is perhaps reasonable in the context of an interpreter that is copying the ref accumulator on each step.

EDIT2: On reflection the performance of this formulation is a bit of a drag. I would like a 10x speedup if possible!

Luke Gorrie
  • 467
  • 3
  • 14

1 Answers1

1

I join the data.frame on itself. Then select all rows that meet your criteria. Then select the closest matching row. To get the same amount of rows in the result (NA in first row) I again join the base data.frame (right_join).

LEVEL <- c(3,2,2,1,2,3)
TIME <- c('0000','0010','0020','0030','0040','0050')

df <- data.frame(LEVEL, TIME, stringsAsFactors = F)

df %>%  
  merge(df, by = NULL, all=T) %>%  
  filter(LEVEL.y >= LEVEL.x & TIME.x > TIME.y) %>%
  group_by(TIME.x, LEVEL.x) %>% 
  filter(row_number(desc(TIME.y))==1) %>%
  mutate(delta = as.numeric(TIME.x) - as.numeric(TIME.y)) %>%
  rename(LEVEL = LEVEL.x, TIME=TIME.x) %>%  
  select(TIME, LEVEL, delta) %>%
  right_join(df)

Another approach would be to get calculate the min(delta) for every group, instead of ordering and selecting the first row. I prefer above solution, because you can then use the other information of the matching row as well.

df %>% merge(df, by = NULL, all=T) %>%  
  filter(LEVEL.y >= LEVEL.x & TIME.x > TIME.y) %>%
  group_by(TIME.x, LEVEL.x) %>%  
  summarise(delta = min(as.numeric(TIME.x) - as.numeric(TIME.y))) %>%
  rename(LEVEL = LEVEL.x, TIME=TIME.x) %>%  
  select(TIME, LEVEL, delta) %>%
  right_join(df)
Wietze314
  • 5,942
  • 2
  • 21
  • 40