4

Say I have a data.table in which one column contains linear models:

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]

Now I want to extract the r-squared value from each model. Can I do better than this?

models[, list(rsq = summary(mod[[1]])$r.squared), by = g]

##    g      rsq
## 1: 1 1.000000
## 2: 2 1.000000
## 3: 3 0.004452

Ideally, I'd like to be able to eliminate the [[1]] and not rely on knowing the previous grouping variable (I know I want each row to be it's own group).

Arun
  • 116,683
  • 26
  • 284
  • 387
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 1
    Maybe you should explain if there's a certain criteria you expect the `data.table` to have or why you need this? Given `models` and asking for `r.squared`, other than grouping by `g`, I can only think of using `lapply(...)` and then adding the result as a new column. – Arun Apr 09 '14 at 22:10
  • 2
    you could group by `1:nrow(models)` to avoid "knowing" about g – eddi Apr 09 '14 at 23:17
  • @arun If you know you're working with individual rows, you could internally use `[[` instead of `[`. That's what I'm thinking of for dplyr (with a special row wise grouper) and I was wondering if data table already had similar functionality. – hadley Apr 10 '14 at 17:53
  • 2
    Just because I don't know any better, why is using `[[1]]` worth avoiding? – Dean MacGregor Apr 10 '14 at 18:19
  • @DeanMacGregor because in this case it's redundant – hadley Apr 10 '14 at 21:27

4 Answers4

4

This is just summary being a bad little function, that's not vectorized. So how about vectorizing it manually (this is roughly the same as @mnel's solution):

r.squared = Vectorize(function(x) summary(x)$r.squared)

models[, rsq := r.squared(mod)]
models
#   g  mod         rsq
#1: 1 <lm> 1.000000000
#2: 2 <lm> 1.000000000
#3: 3 <lm> 0.004451631
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Most of the functions that work with linear models won't be vectorised, so that approach is a bit painful in general. – hadley Apr 10 '14 at 17:51
  • @hadley yeah, so you have to apply the function 1 by 1 yourself - how you choose to do it is up to you (explicit `*apply` or `Vectorize` or your original solution), but I don't see a way around that unfortunate fact – eddi Apr 10 '14 at 18:01
3

My first thought was to use rapply, with classes='lm', but that does not work. sapply, however does (to my surprise)

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]
models[, rsq := sapply(mod, function(x) summary(x)$r.squared)]

models
#     g  mod         rsq
#  1: 1 <lm> 1.000000000
#  2: 2 <lm> 1.000000000
#  3: 3 <lm> 0.004451631

"Doing other things" to the model within data.table might be problematic because of the way .SD works as environment.

See Why is using update on a lm inside a grouped data.table losing its model data? for an example of what can occur. This is subject of bug #2590.

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
1

Would that work?

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)
models <- dt[, list(rsq = summary(lm(y ~ x))$r.squared), by = g]
#   g         rsq
#1: 1 1.000000000
#2: 2 1.000000000
#3: 3 0.004451631
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    No, because as well as extracting the r-squared from the model, I might want to do other things with them. – hadley Apr 10 '14 at 00:32
0

I know this question is inactive for more than two years but the solution already exists and is not described here.

require(purrr)
require(broom)
map_df(models$mod, glance)
Jan Kislinger
  • 1,441
  • 14
  • 26