Working with rich objects in data.table columns

Question

Say I have a data.table in which one column contains linear models:

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]

Now I want to extract the r-squared value from each model. Can I do better than this?

models[, list(rsq = summary(mod[[1]])$r.squared), by = g]

##    g      rsq
## 1: 1 1.000000
## 2: 2 1.000000
## 3: 3 0.004452

Ideally, I'd like to be able to eliminate the [[1]] and not rely on knowing the previous grouping variable (I know I want each row to be it's own group).

Maybe you should explain if there's a certain criteria you expect the `data.table` to have or why you need this? Given `models` and asking for `r.squared`, other than grouping by `g`, I can only think of using `lapply(...)` and then adding the result as a new column. — Arun, Apr 09 '14 at 22:10
you could group by `1:nrow(models)` to avoid "knowing" about g — eddi, Apr 09 '14 at 23:17
@arun If you know you're working with individual rows, you could internally use `[[` instead of `[`. That's what I'm thinking of for dplyr (with a special row wise grouper) and I was wondering if data table already had similar functionality. — hadley, Apr 10 '14 at 17:53
Just because I don't know any better, why is using `[[1]]` worth avoiding? — Dean MacGregor, Apr 10 '14 at 18:19

score 4 · Answer 1 · answered Apr 10 '14 at 15:17

4

This is just summary being a bad little function, that's not vectorized. So how about vectorizing it manually (this is roughly the same as @mnel's solution):

r.squared = Vectorize(function(x) summary(x)$r.squared)

models[, rsq := r.squared(mod)]
models
#   g  mod         rsq
#1: 1 <lm> 1.000000000
#2: 2 <lm> 1.000000000
#3: 3 <lm> 0.004451631

answered Apr 10 '14 at 15:17

eddi

49,088
6
104
155

Most of the functions that work with linear models won't be vectorised, so that approach is a bit painful in general. – hadley Apr 10 '14 at 17:51
@hadley yeah, so you have to apply the function 1 by 1 yourself - how you choose to do it is up to you (explicit `*apply` or `Vectorize` or your original solution), but I don't see a way around that unfortunate fact – eddi Apr 10 '14 at 18:01

score 3 · Answer 2 · edited May 23 '17 at 12:00

3

My first thought was to use rapply, with classes='lm', but that does not work. sapply, however does (to my surprise)

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]
models[, rsq := sapply(mod, function(x) summary(x)$r.squared)]

models
#     g  mod         rsq
#  1: 1 <lm> 1.000000000
#  2: 2 <lm> 1.000000000
#  3: 3 <lm> 0.004451631

"Doing other things" to the model within data.table might be problematic because of the way .SD works as environment.

See Why is using update on a lm inside a grouped data.table losing its model data? for an example of what can occur. This is subject of bug #2590.

edited May 23 '17 at 12:00

Community

1
1

answered Apr 10 '14 at 05:49

mnel

113,303
27
265
254

using `lapply` make more sense here imo – eddi Apr 10 '14 at 15:22
@eddi - except that `lapply` will return a list, and thus transpose the response. – mnel Apr 11 '14 at 01:19
you're right, I got confused by how `data.table` outputs the two the same way – eddi Apr 11 '14 at 15:20

score 1 · Answer 3 · answered Apr 09 '14 at 21:38

1

Would that work?

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)
models <- dt[, list(rsq = summary(lm(y ~ x))$r.squared), by = g]
#   g         rsq
#1: 1 1.000000000
#2: 2 1.000000000
#3: 3 0.004451631

answered Apr 09 '14 at 21:38

David Arenburg

91,361
17
137
196

1

No, because as well as extracting the r-squared from the model, I might want to do other things with them. – hadley Apr 10 '14 at 00:32

score 0 · Answer 4 · answered Mar 10 '17 at 16:01

0

I know this question is inactive for more than two years but the solution already exists and is not described here.

require(purrr)
require(broom)
map_df(models$mod, glance)

answered Mar 10 '17 at 16:01

Jan Kislinger

1,441
14
26

Working with rich objects in data.table columns

4 Answers4