-1

Suppose I have a data.table with columns X1,X2,X3,Y. For each row, I would like to treat the entries in X1,X2,X3 as vector of length 3, take the inner product with a fixed vector say beta of length 4, subtract the result from the entry inY, square the result, and either output the final result for every row (or save it as another column).

After much research, I came up with this

dat[, (Y-sum(.SD*beta))^2, .SDcols=c(1:3)]

which does not work as expected.

Bonus point #1: Doing this with 3 replaced by a general n.

Bonus point #2: Suppose I have a grp column with group indices and I want to average these residual squares by group.

passerby51
  • 835
  • 2
  • 9
  • 23

1 Answers1

2

Assuming y is the first column of your data table dat and the rest of the columns are predictors. This works for bonus 1.

mat = as.matrix(dat[, x1:x3, with = F])
pred = cbind(1, mat) %*% beta
dat[, rss := (pred - y)^2]

For bonus 2:

dat[, mean_by_grp := mean(rss), by = grp]

To avoid the matrix conversion, you could do this:

dat[, pred := beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x3]

writing out the inner product.


Complete reproducible example

set.seed(47)
dat = data.table(replicate(4, rnorm(5)))
setnames(dat, c("y", paste0("x", 1:3)))
dat[, grp := c("A", "A", "B", "B", "B")]
beta = 1:4

mat = as.matrix(dat[, x1:x3, with = F])
pred = cbind(1, mat) %*% beta
dat[, rss := (pred - y) ^ 2]

dat[, mean_by_grp := mean(rss), by = grp]
dat
#             y          x1          x2          x3 grp       rss mean_by_grp
# 1:  1.9946963 -1.08573747 -0.92245624  0.67077922   A 10.565250    7.064164
# 2:  0.7111425 -0.98548216  0.03960243 -0.08107805   A  3.563078    7.064164
# 3:  0.1854053  0.01513086  0.49382018  1.26424109   B 54.512843   38.263204
# 4: -0.2817650 -0.25204590 -1.82822917 -0.70338819   B 56.558929   38.263204
# 5:  0.1087755 -1.46575030  0.09147291 -0.04057817   B  3.717840   38.263204
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • OK, thanks. Is there a better way? `pred` is something calculated outside the data.table. It should be a natural way to just pass "x" and "y" for every row to a function that operates row-wise. I don't want to convert data.table to a matrix. – passerby51 Nov 23 '16 at 18:17
  • 2
    This is the better way. Matrix multiplication is highly optimized. Operating row-wise is the slow way to do this. – Gregor Thomas Nov 23 '16 at 18:18
  • OK... I suppose I could do something like `mat = as.matrix(dat[, 1:3, with=F])` for avoid turning the whole thing into a matrix first. – passerby51 Nov 23 '16 at 18:24
  • OK... the non-matrix conversion approach is not generalizable. It makes me somewhat sad if there is no elegant way of doing the row-wise operation. – passerby51 Nov 23 '16 at 18:26
  • 1
    In general though, matrix multiplication is something linear algebra packages are tuned for. You're not going to be able to improve on that. Though you could switch the library R is compiled with. See, e.g., [Faster R through better BLAS](https://www.r-bloggers.com/faster-r-through-better-blas/). – Gregor Thomas Nov 23 '16 at 18:27
  • 1
    Row-wise is completely the wrong way to go. You've heard how R is good with vectorization, right? Doing it row-wise is disabling vectorizaton. – Gregor Thomas Nov 23 '16 at 18:28
  • Yes. There is just too much effort to vectorize and it often leads to rather obscure code . From a high-level programming standpoint, it is natural to have fast row-wise operation. I am at that point in life, where I prefer to have readable natural code (and speed) at the same time. (Developments in the Julia community are appealing for this reason.) Sorry for the rant! – passerby51 Nov 23 '16 at 18:32
  • By the way, I don't seem to be able to combine the two steps into `dat[, mean_by_grps := mean((pred - y)^2), by=idx]`. This produces something else. – passerby51 Nov 23 '16 at 18:43
  • 1
    Well, perhaps you should create a reproducible example. Works fine for me with the one I cooked up. I'll add it. – Gregor Thomas Nov 23 '16 at 18:45
  • Oh, this is in fact what I have tried `dat[, mean((pred - y)^2), by=idx]`. The version without creating the column does not work, the one I wrote seems to work. This is strange. – passerby51 Nov 23 '16 at 18:51
  • Another way of avoiding matrix conversion is `Reduce`, as in eddi's answer to my question: http://stackoverflow.com/a/19279500/ – Frank Nov 24 '16 at 19:43