17

Here is a good SO explanation about row operations in data.table

One alternative that came to my mind is to use a unique id for each row and then apply a function using the by argument. Like this:

library(data.table)

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],
                 V1=1:5,
                 V2=3:7,
                 V3=5:1)

# create a column with row positions
dt[, rowpos := .I]

# calculate standard deviation by row
dt[ ,  sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ] 

Questions:

  1. Is there a good reason not to use this approach? perhaps other more efficient alternatives?

  2. Why does using by = .I doesn't work the same?

    dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]

Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • 2
    For this case, you can use `Reduce("+", dt[, 2:4, with = FALSE])` to (1) _not_ loop by rows and (2) _not_ convert to "matrix". For other by-row operations, you could consider either similar to `Reduce` operations to avoid applying a function to each row or -perhaps- storing data as a "matrix" and use "matrix" - specific/efficient functions – alexis_laz Jun 07 '16 at 08:41
  • 2
    ..with `sd`, too, looking at [here](http://stackoverflow.com/questions/25099825/row-wise-variance-of-a-matrix-in-r) and [here](http://stackoverflow.com/questions/17549762/is-there-such-colsd-in-r), an option seems to be `sqrt(rowSums((dt[, 2:4, with = FALSE] - Reduce("+", dt[, 2:4, with = FALSE]) / 3) ^ 2) / (3 - 1))` – alexis_laz Jun 07 '16 at 09:22
  • I don't know why `by=.I` doesn't error out, but is not equivalent to `1:nrow(dt)` - I'd file a bug report if I were you – eddi Jun 07 '16 at 20:18
  • Thank @eddi, I've just filed it it. https://github.com/Rdatatable/data.table/issues/1732 – rafa.pereira Jun 07 '16 at 20:49

1 Answers1

26

UPDATE:

Since data.table version 1.4.3 or later, by=.I has been implemented to work as expected by OP for row-wise grouping. Note using by=.I will create a new column in the data.table called I that has the row numbers. The row number column can then be kept or deleted according to preference.

The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.


Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.

1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):

dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
  dt[, rowpos := .I] 
  dt[ ,  sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]

library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
#   expr      min       lq     mean   median       uq      max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608   100   b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464   100  a 

2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.

It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)

.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL

The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.

3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?

Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column

dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)

f.rowpos <- function() {
  dt[, rowpos := .I] 
  dt[,  sdd := sum(.SD[, 2:4]), by = rowpos ] 
}

f.nrow <- function() {
  dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}

f.forset<- function() {
  for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}

microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
# f.rowpos()  559.1115  575.3162  580.2853  578.6865  588.5532  599.7591     5
#   f.nrow()  558.4327  582.4434  584.6893  587.1732  588.6689  606.7282     5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486     5

So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.

dww
  • 30,425
  • 5
  • 68
  • 111
  • Great answer! I was curious to know why you used `NROW` instead of `nrow`? I _think_ they're equivalent in this context, but wasn't sure if there is a subtlety I'm missing. – Dan Dec 18 '20 at 13:28
  • 1
    @Lyngbakr no real subtletly - either will work. I tend to prefer `NROW`, because it also generalizes to vectors. I've seen cases in the past where selecting columns using a vector of column names or positions may inadvertently return a vector rather than data.frame or data.table, if only a single column was specified. `NROW` catches these edge cases, but `nrow` does not. So I tend to stick with `NROW` as a general policy unless I specifically want the nrow behaviour. – dww Dec 18 '20 at 14:34
  • Good to know. Thanks for the insights! – Dan Dec 18 '20 at 18:40
  • Anyway, the by=-I is going to be included: https://github.com/Rdatatable/data.table/pull/5235 – skan Mar 19 '22 at 00:40
  • Thanks for the info @skan - once that makes it into the release version, I'll update this answer with a note about the change. – dww Mar 19 '22 at 01:53