As I'm learning data.table
, I found a situation I can't elegantly work around.
Up front: the absurdity of the lm
formula is obvious, I'm trying to determine if this nuance can be easily worked around with a keyword or special operator within the data.table
ecosystem.
library(data.table)
mt <- as.data.table(mtcars)
mt[, list(model = list(lm(mpg ~ disp))), by = "cyl"]
# cyl model
# 1: 6 <lm>
# 2: 4 <lm>
# 3: 8 <lm>
mt[, list(model = list(lm(mpg ~ disp + cyl))), by = "cyl"]
# Error in model.frame.default(formula = mpg ~ disp + cyl, drop.unused.levels = TRUE) :
# variable lengths differ (found for 'cyl')
This is because inside the block, cyl
is a vector of length 1, not a column like the rest of the values:
mt[, list(model = { browser(); list(lm(mpg ~ cyl+disp)); }), by = "cyl"]
# Called from: `[.data.table`(mt, , list(model = {
# browser()
# list(lm(mpg ~ cyl + disp))
# ...
# Browse[1]>
# debug at #1: list(lm(mpg ~ cyl + disp))
# Browse[2]>
disp
# [1] 160.0 160.0 258.0 225.0 167.6 167.6 145.0
# Browse[2]>
cyl
# [1] 6
The most straight-forward appears to be to manually lengthen it internally as a temporary variable or literally where needed:
mt[, list(model = { cyl2 <- rep(cyl, nrow(.SD)); list(lm(mpg ~ cyl2+disp)); }), by = "cyl"]
mt[, list(model = list(lm(mpg ~ rep(cyl, nrow(.SD))+disp))), by = "cyl"]
Q: Is there a more elegant way to deal with this?
Various loosely-related questions, seeding my curiosity (towards embedding "stuff" in DT objects):
- Setting column name in "group by" operation with data.table
- Run a function inside data.table in R
- Using data.table to create a column of regression coefficients
Candidates so far, many good:
mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]