R data.table dot product with matching column names (for each group)

Question

I have a datatable of data and a datatable of fitted coefficients. I want to calculate the fitted value for each row.

dt = data.table(a = rep(c("x","y"), each = 5), b = rnorm(10), c = rnorm(10), d = rnorm(10))
coefs = data.table(a = c("x","y"), b = c(0, 1), d = c(2,3))
dt
#    a           b          c           d
# 1: x -0.25174915 -0.2130797 -0.67909764
# 2: x -0.35569766  0.6014930  0.35201386
# 3: x -0.31600957  0.4398968 -1.15475814
# 4: x -0.54113762 -2.3497952  0.64503654
# 5: x  0.11227873  0.0233775 -0.96891456
# 6: y  1.24077566 -1.2843439  1.98883516
# 7: y -0.23819626  0.9950835 -0.17279980
# 8: y  1.49353589  0.3067897 -0.02592004
# 9: y  0.01033722 -0.5967766 -0.28536224
#10: y  0.69882444  0.8702424  1.24131062

coefs # NB no "c" column
#   a b d
#1: x 0 2
#2: y 1 3

For each a=="x" row in dt, I want 0*b+2*d; and for each a=="y" row in dt, I want 1*b+3*d.

Is there a datatable way to do this without hardcode the column name? I'm happy to put the column names in a variable cols = colnames(coefs)[-1].

It's easy to loop over groups and rbind together, so if the grouping is causing trouble, please ignore that part.

My similar question: http://stackoverflow.com/q/19279075/ For what it's worth, I think it's natural that one would be concerned about matching names dynamically, and that doesn't make this question a "moving target" at all. — Frank, Aug 03 '16 at 16:06

Roland · Accepted Answer · 2016-08-03T14:15:43.423

Join the data.tables:

dt[coefs, res := b * i.b + d * i.d, on = "a"]
 #   a           b            c          d        res
 #1: x  0.09901786 -0.362080111 -0.5108862 -1.0217723
 #2: x -0.16128422  0.169655945  0.3199648  0.6399295
 #3: x -0.79648896 -0.502279345  1.3828633  2.7657266
 #4: x -0.26121421  0.480548972 -1.1559392 -2.3118783
 #5: x  0.54085591 -0.601323442  1.3833795  2.7667590
 #6: y  0.83662761  0.607666970  0.6320762  2.7328562
 #7: y -1.92510391 -0.050515610 -0.3176544 -2.8780671
 #8: y  1.65639926 -0.167090105  0.6830158  3.7054466
 #9: y  1.48772354 -0.349713539 -1.2736467 -2.3332166
#10: y  1.49065993  0.008198885 -0.1923361  0.9136516

Usually you would use the matrix product here, but that would mean that you had to coerce the respective subset to a matrix. That would result in a copy being made and since data.tables are mainly used for larger data, you want to avoid copies.

If you need dynamic column names, the most simple solution that comes to mind is actually an eval/parse construct:

cols = colnames(coefs)[-1]
expr <- parse(text = paste(paste(cols, paste0("i.", cols), sep = "*"), collapse = "+"))
#expression(b*i.b+d*i.d)

dt[coefs, res := eval(expr), on = "a"]

Maybe someone else can suggest a better solution.

Here is a solution using matrix multiplication:

dt[, res := as.matrix(.SD) %*% unlist(coefs[a == .BY, .SD, .SDcols = cols]), 
  by = "a", .SDcols = cols]

Of course this makes copies, which is potentially less efficient then the eval solution.

Thanks. Is it possible to not hardcode the column name? I'm happy to put them in a variable like `cols = colnames(coefs)[-1]` and then how to go from there? — jf328, Aug 03 '16 at 13:27
Please don't make your question a moving target. Provide all specifications when you ask a question. — Roland, Aug 03 '16 at 13:48

score 0 · Answer 2 · answered Aug 03 '16 at 13:54

0

I found out that data.table of all numerical type columns can do arithmetic operations (+,-,*,/), but no name matching -- just order matching.

> coefs
   a b d
1: x 0 2
2: y 1 3
> coefs[, .(b,d)] * coefs[, .(b,d)]
   b d
1: 0 4
2: 1 9
> coefs[, .(b,d)] * coefs[, .(d,b)]
   b d
1: 0 0
2: 3 3

so a solution based on this

> cols = colnames(coefs)[-1]
> zz = rowSums(coefs[dt[,.(a)], .SD, on = 'a', .SDcols = cols] * dt[, .SD, .SDcols = cols])
> dt[, newcol := zz]

answered Aug 03 '16 at 13:54

jf328

6,841
10
58
82

If you don't mind copies (which `rowSums` makes by coercing the data.table to a matrix), you should use my last suggestion, i.e., matrix multiplication. – Roland Aug 03 '16 at 14:16

Mike H. · Answer 3 · 2016-08-03T14:03:31.063

Another alternative (but slower) approach is:

dt$res <- unsplit(Map(function(x,y){x$b*y$b + x$d*y$d}, split(dt, dt$a=="x"), 
                 split(coefs,coefs$a=="x")),dt$a=="x")

    dt
    a           b          c          d        res
 1: x  0.47859729  1.3479271  0.5691897  1.1383794
 2: x  0.28491505 -0.3291934  1.8621365  3.7242730
 3: x -1.43894695  1.5555413  0.3685772  0.7371544
 4: x  0.04360066  0.1358920  0.5240700  1.0481400
 5: x -1.39897890 -0.0175886 -0.6876451 -1.3752901
 6: y -0.60952146  1.2331907 -0.3582176 -1.6841742
 7: y  0.31777772  1.4090295 -0.4053615 -0.8983067
 8: y  0.42758431 -0.3746061  2.1208417  6.7901094
 9: y -0.60701063 -0.9232092  1.9386482  5.2089341
10: y -1.52042316 -0.8871454 -0.9314232 -4.3146927

This same code would work in base R as well if your data was already data.frames.

Right, just wanted to show a different way of doing it. – Mike H. Aug 03 '16 at 14:03 — Mike H., Aug 03 '16 at 14:03

R data.table dot product with matching column names (for each group)

3 Answers3