2

I have a millions row long data.table with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.

I need to calculate a opsdiff column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on column type. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.

> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))
> dt
    type opscounter
 1:    a     105609
 2:    a     106316
 3:    a     106705
 4:    a        489
 5:    a        723
 6:    a       1250
 7:    b     105609
 8:    b     106316
 9:    b     106705
10:    b        489
11:    b        723
12:    b       1250

The result I wish to get:

> dt
    type opscounter     opsdiff
 1:    a     105609          NA
 2:    a     106316         707
 3:    a     106705         389
 4:    a        489         489
 5:    a        723         234
 6:    a       1250         527
 7:    b     105609          NA
 8:    b     106316         707
 9:    b     106705         389
10:    b        489         489
11:    b        723         234
12:    b       1250         527
> 

I tried to build a construction based on ifelse() but I don't know how to address the current row's opscounter value in else part and neither the dual call of diff() is perhaps an efficient usage:

dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]

How can I correctly calculate the opsdiff column?

amonk
  • 1,769
  • 2
  • 18
  • 27
  • 2
    It is a very strange behaviour IMO, you could try `dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter]`, although I don't think this is the best approach... – David Arenburg Oct 07 '14 at 15:20
  • The following approach might be a little simpler that David Arenburgs approach, but yields the same result: opsdiff <- tapply(dt$opscounter, INDEX=dt$type, FUN=function(x) { diffvalues <- diff(x); diffvalues[diffvalues < 0] <- x[which(diffvalues < 0)+1]; return(c(NA, diffvalues)) }); dt$opsdiff <- unlist(opsdiff) – CL. Oct 07 '14 at 15:28
  • 1
    You should pass the corresponding elements to `opscounter`. Try `dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), tail(opscounter, -1L))), by=type]`. One more reason to move away from `ifelse` :-). – Arun Oct 07 '14 at 15:34
  • 1
    David's approach is the way to go imo. And lol @user2706569 - in what world is that simpler?! – eddi Oct 07 '14 at 16:01
  • @eddi: True, in a comment, without any newlines, my approach doesn't look quite clear. But that's just a matter of formatting. However, in my opinion defining a function that step by step calculates the desired result is simpler (easier to understand) than David's solution. This does not mean that it is *better*, but I would say, it's *simpler*. – CL. Oct 07 '14 at 16:39
  • Thanks all, David's solutions seems most clean to me. @DavidArenburg: what does actually mean that second `[]` bracket statement? Does it call for second pass of the vector (column)? Can I theoretically create a chain of 2,3,n of such? – Lukáš Kubín Oct 07 '14 at 16:41
  • @user2706569 here's the `data.table` equivalent of your expression: `dt[, opsdiff := {diffvalues = diff(opscounter); diffvalues[diffvalues < 0] = opscounter[which(diffvalues < 0) + 1]; c(NA, diffvalues)}, by = type]`. So this simplifies your formatting quite a bit and removes all of that terrible `tapply` mess, but still seems to me to be the much more complicated solution - too many moving pieces that one has to understand. – eddi Oct 07 '14 at 17:05
  • @LukášKubín `[.data.table` returns a `data.table`, so you can string together as many `[]`'s as you like. In the case of assignment, i.e. `:=` in the `j`-expression, it returns the modified `data.table`, so you can keep modifying it via consecutive `[]`'s. – eddi Oct 07 '14 at 17:09

2 Answers2

2

The preferred solution would be:

dt[, opsdiff := c(NA, diff(opscounter)), 
   by = type][opsdiff < 0, opsdiff := opscounter][]
#     type opscounter opsdiff
#  1:    a     105609      NA
#  2:    a     106316     707
#  3:    a     106705     389
#  4:    a        489     489
#  5:    a        723     234
#  6:    a       1250     527
#  7:    b     105609      NA
#  8:    b     106316     707
#  9:    b     106705     389
# 10:    b        489     489
# 11:    b        723     234
# 12:    b       1250     527

Note that I've added additional [] in order to print results on the fly and in order to illustrate that you can add more than one of these.

In general, it would be better to avoid ifelse (especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pull opscounter from, which adds to the complexity (see @Aruns comment for possible override).

Regarding your question in comments, a data.table operation of the form DT[...] just calls the function [.data.table(DT, ...). It's not any different on a data.frame; there is a similar function [.data.frame.

Note that a data.table is also a data.frame. See class(dt) and also read ?data.table.

To make it even clearer, in data.table, tacking [...] one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on a data.frame are limited, and therefore the use of chaining itself, unlike data.table.

df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by reference
df[df$type == "a", ][2:3, ]
#   type opscounter
# 2    a     106316
# 3    a     106705

Finally, in order to illustrate the ineffectiveness of ifelse, here is a benchmark:

set.seed(123)
n <- 1e6
dt <- data.table(type = rep(c("a","b"), each = n), 
                 opscounter = sample(1:1e5, n*2, replace = TRUE))


library(microbenchmark)
microbenchmark(
dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter],
dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type]
)

# Unit: milliseconds
#                                                                                                         expr
#             dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff <  0, `:=`(opsdiff, opscounter)]
# dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type]
#      min       lq      mean    median        uq       max neval
# 228.0445 255.4006  285.8163  281.1388  307.4195  508.3841   100
# 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704   100

The ifelse solution is slower by a factor of ~4.

Community
  • 1
  • 1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks a lot David, Your solution is really fast and elegant after tested on my real data. Now, when I need to process more columns of my table in the same way - ie. create column opsdiff2 based on values in opscounter2 etc. - what is a best method to follow? Do I just repeatedly execute the `dt[,...` processing with different column being created each time, or is there even something smarter? – Lukáš Kubín Oct 07 '14 at 20:50
  • That will be a bit more complected, for the first step, you can do something like: `n <- 10 # assuming you have 10 columns to modify; dt[, paste0("opsdiff", 1:n) := lapply(.SD, function(x) c(NA, diff(x))), by = type, .SDcols = paste0("opscounter", 1:n)]` when you can modify `n` to whatever you like, and I assume here that your first column called `opscounter1` instead of `opscounter` (for convince) – David Arenburg Oct 07 '14 at 21:00
0

Since this is not an approach with data.table, this would not be the ideal one. But, the following would be an alternative approach.

library(dplyr)
df <- data.frame(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))

df %>%
    group_by(type) %>%
    mutate(opsdiff = opscounter - lag(opscounter)) %>%
    mutate(opsdiff = ifelse(opsdiff < 0, opscounter, opsdiff))

#   type opscounter opsdiff
#1     a     105609      NA
#2     a     106316     707
#3     a     106705     389
#4     a        489     489
#5     a        723     234
#6     a       1250     527
#7     b     105609      NA
#8     b     106316     707
#9     b     106705     389
#10    b        489     489
#11    b        723     234
#12    b       1250     527
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • No need to use `data.frame`, you can use OPs original data as `dplyr` knows how to handle `data.table` objects. Not to metnion `dplyr` uses `data.table` under the hood either way – David Arenburg Oct 07 '14 at 18:23