11

As I'm learning data.table, I found a situation I can't elegantly work around.

Up front: the absurdity of the lm formula is obvious, I'm trying to determine if this nuance can be easily worked around with a keyword or special operator within the data.table ecosystem.

library(data.table)
mt <- as.data.table(mtcars)
mt[, list(model = list(lm(mpg ~ disp))), by = "cyl"]
#    cyl model
# 1:   6  <lm>
# 2:   4  <lm>
# 3:   8  <lm>
mt[, list(model = list(lm(mpg ~ disp + cyl))), by = "cyl"]
# Error in model.frame.default(formula = mpg ~ disp + cyl, drop.unused.levels = TRUE) : 
#   variable lengths differ (found for 'cyl')

This is because inside the block, cyl is a vector of length 1, not a column like the rest of the values:

mt[, list(model = { browser(); list(lm(mpg ~ cyl+disp)); }), by = "cyl"]
# Called from: `[.data.table`(mt, , list(model = {
#     browser()
#     list(lm(mpg ~ cyl + disp))
#   ...
# Browse[1]> 
# debug at #1: list(lm(mpg ~ cyl + disp))
# Browse[2]> 
disp
# [1] 160.0 160.0 258.0 225.0 167.6 167.6 145.0
# Browse[2]> 
cyl
# [1] 6

The most straight-forward appears to be to manually lengthen it internally as a temporary variable or literally where needed:

mt[, list(model = { cyl2 <- rep(cyl, nrow(.SD)); list(lm(mpg ~ cyl2+disp)); }), by = "cyl"]
mt[, list(model = list(lm(mpg ~ rep(cyl, nrow(.SD))+disp))), by = "cyl"]

Q: Is there a more elegant way to deal with this?


Various loosely-related questions, seeding my curiosity (towards embedding "stuff" in DT objects):


Candidates so far, many good:

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    Not a huge improvement, but to avoid making the column inside `J` you could make duplicate `cyl` column and then use that? – Mike H. Oct 31 '18 at 23:14
  • 4
    You can rename the `by=` on the fly - `mt[, length(cyl), by =.(cylgroup=cyl)]` ? – thelatemail Oct 31 '18 at 23:16
  • @MikeH., that's close but is close to my first working method; yours just requires (encourages?) removing the duplicate column later. Thanks. – r2evans Oct 31 '18 at 23:19
  • @thelatemail, that's probably going to be "it": it leads by code-golf (`by=.(a=cyl)`) and is almost completely self-cleaning, where the only artifact is the returned group column name that does not match the original column name. The same-name-cleanup could be `mt[,...,by=.(a=cyl)][,.(cyl=a,model),]`, detracting from its sheer awesomeness by a little, unless there's a better way to rename inline. Thanks! – r2evans Oct 31 '18 at 23:25
  • Another idea pass the data to lm directly and use your `by` to subset: `mt[, list(model = { list(lm(mpg ~ cyl+disp, data = mt[.I])); }), by = .(cyl)]`? Not ideal because you're subsetting again for each group though – Mike H. Oct 31 '18 at 23:28
  • That's another good thought, I wondered if/how to incorporate `.I` in it. Agreed, this subsets each time, not ideal with larger datasets. – r2evans Oct 31 '18 at 23:33
  • @thelatemail, your method does not work with `.SD`: `mt[,.(data = .({browser();.SD})),by=.(a=cyl)]` does not present `cyl` as available. – r2evans Oct 31 '18 at 23:37
  • 2
    Also, a broader question: Is this about general use or lm's specifically? I don't believe passing a constant (which cyl or whatever your group by will be for each group) makes any sense to an lm – Mike H. Oct 31 '18 at 23:38
  • 1
    `lm` was merely an example for embedding different objects. It might be analogous to a `tidyr`/`broom` way of doing things, though I'm not trying to start a war or reinvent. For instance, `mt[,.(data=.(.SD)),by=.(a=cyl)]` is similar to `tidyr::nest(mt,-cyl)` but it's not always as intuitive to do follow-on per-row model-building from this point on. – r2evans Oct 31 '18 at 23:42
  • 4
    Final thought - you can also reference cyl from the `.SDcols=` argument - `mt[, .(.(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]` or even `mt[, .(.(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]` – thelatemail Oct 31 '18 at 23:50
  • 2
    "A more elegant way" might be opinion-based. Fwiw, I think thela's way is good, but there's also the data= argument: `mtm = mt[, list(model = list(lm(mpg ~ disp + cyl, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]` – Frank Nov 01 '18 at 10:38
  • 1
    Thanks @Frank, that's another good suggestion. I understand your point about opinion-based, perhaps a better way to offer it would be: efficient, without unintended side-effects, as intended by the devs, perhaps even whatever the DT equivalent of "pythonic" is. – r2evans Nov 01 '18 at 15:56
  • @MikeH. yes, it does and I plan to do that, but I'd like to encourage more visits before I do that. Having an answer tends to slow down page-visits. – r2evans Nov 01 '18 at 18:16
  • 1
    Here is my final thought as well: Since you want this to be generalizable it's probably best to use a method that doesn't rely on pass to the `data` argument of `lm`. thelatemail's strategy of passing all the columns to .SD is good, but I suppose could be cumbersome for large data. Alternatively, `assign` might actually be useful here: `mt[, list(model = { for(i in 1:length(.BY)){assign(names(.BY)[i], rep(.BY[[i]], .N))}; list(lm(mpg ~ cyl+disp)); }), by = .(cyl)] `. This should make all your `by` variables available within `J` – Mike H. Nov 01 '18 at 18:31
  • You could also write a wrapper function so that the code isn't as ugly, but you'd have to be careful about assigning to the correct environment – Mike H. Nov 01 '18 at 18:33

1 Answers1

2

Thanks to all for the candidates.

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

The performance (with this small model) seems to have some small differences:

library(microbenchmark)
microbenchmark(
  c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
  c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
  c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr    min      lq     mean  median      uq     max neval
#    c1 3.7328 4.21745 4.584591 4.43485 4.57465  9.8924   100
#    c2 2.6740 3.11295 3.244856 3.21655 3.28975  5.6725   100
#    c3 2.8219 3.30150 3.618646 3.46560 3.81250  6.8010   100
#    c4 2.9084 3.27070 3.620761 3.44120 3.86935  6.3447   100
#    c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931   100

With larger data

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
  c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
  c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
  c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr     min       lq     mean  median       uq      max neval
#    c1 27.1635 30.54040 33.98210 32.2859 34.71505  76.5064   100
#    c2 23.9612 25.83105 28.97927 27.5059 30.02720  67.9793   100
#    c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742   100
#    c4 25.6469 27.84185 30.52403 29.8286 32.60805  37.8675   100
#    c5 29.2477 32.32465 35.67090 35.0291 37.90410  68.5017   100

(I'm guessing the relative performance scales similarly. A better adjudication might include much wider data.)

By median runtime alone, it looks like the top (by a very small margin) is:

mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
r2evans
  • 141,215
  • 6
  • 77
  • 149