How to update `lm` or `glm` model on same subset of data?

Question

I am trying to fit two nested models and then test those against each other using anova function. The commands used are:

probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt, 
    family=binomial(link = "probit"))
nprobit <- update(probit, . ~ . - afqt1)
anova(nprobit, probit, test="Rao")

However, the variable afqt1 apparently contains NAs and because the update call does not take the same subset of data, anova() returns error

Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, : models were not all fitted to the same size of dataset

Is there a simple way how to achieve refitting the model on the same dataset as the original model?

create a subset of your data with all those covariates and use `na.omit` — rawr, Mar 15 '14 at 20:20
That is of course possible if you do "small scale" analysis, but if you want to fit and test multiple specifications and your dataset can contain many covariates, it would be rather tedious, that is why I am asking for "simpler" way. (Something like option in `update`???) — krhlk, Mar 15 '14 at 20:27
@rawr In my tests `update()` seems to ignore any `na.action` arguments in the `glm()` call. For instance, `mtcars[1,2] <- NA; nobs( xa <- lm(mpg~cyl+disp, mtcars, na.action = na.exclude) ); nobs( update(xa, .~.-cyl) )` will yield different number of observations for the model fits. How should one use the `na.action` in `lm`/`glm` so that `update()` honors it? — landroni, May 13 '16 at 10:29
@landroni in your second example you are not using the covariate with missing data in your updated formula, so there is no missing data in the model being fit. if you do `nobs( update(xa, .~.-disp) )` you get 31. are you trying to remove all rows with NA in the data _before_ the model is being fit? — rawr, May 13 '16 at 14:13
@rawr No. As the original poster, I'm dealing with a moderately complex data set and it would be cumbersome to remove NA rows every time I run a different specification. So I'm trying to find a way in which R transparently deals with `NA`s when updating an existing `lm` fit, and keeps the sample size fixed throughout; i.e. ensure that the `update`d model is indeed *nested*. I would have assumed `na.exclude` to help with that, but from what I see it doesn't... — landroni, May 13 '16 at 14:52
@landroni if your full model is `mtcars[1:5]`, `data <- na.omit(mtcars[1:5])` removes all rows with any NA values so that you can use data on all your models to ensure that they are fit to the same size data. alternatively `xa$model` contains your 31 observation data — rawr, May 13 '16 at 15:03
@rawr The `na.omit(mtcars[1:5])` approach is impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars); while I *can* do this, having to do this manual bookkeeping of which vars should be sanitized of `NA`s and which shouldn't is precisely what I'd like to avoid. Whereas retrieving the data via `xa$model` would itself be incompatible with the `update()` approach. Unfortunately neither approach would be completely satisfactory under the constraints of the original question... — landroni, May 13 '16 at 16:01
@landroni `nobs( update(xa, .~.-cyl, data = xa$model) )` ? just tried na.omit on 100,000x1,000 data frame and it was almost instant so I don't really know what you mean by impractical, manual bookkeeping, or unsatisfactory — rawr, May 13 '16 at 16:35
@rawr I didn't realize that `update` could overwrite the `data` argument from the original call. This makes it indeed simple and easy to use. I also found a way to easily use the `na.omit` approach (see my answer), where I also explain in a bit more detail why I felt initially that this approach would be impractical. Thanks for the suggestions, very useful. — landroni, May 20 '16 at 08:40

landroni · Answer 1 · 2016-05-22T21:02:40.480

As suggested in the comments, a straightforward approach to this is to use the model data from the first fit (e.g. probit) and update's ability to overwrite arguments from the original call.

Here's a reproducible example:

data(mtcars)
mtcars[1,2] <- NA
nobs( xa <- lm(mpg~cyl+disp, mtcars) ) 
## [1] 31
nobs( update(xa, .~.-cyl) )  ##not nested
## [1] 32
nobs( xb <- update(xa, .~.-cyl, data=xa$model) )  ##nested
## [1] 31

It is easy enough to define a convenience wrapper around this:

update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

This forces the data argument of the updated call to re-use the data from the first model fit.

nobs( xc <- update_nested(xa, .~.-cyl) )
## [1] 31
all.equal(xb, xc)  ##only the `call` component will be different
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

So now you can easily do anova:

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     28 269.97                              
## 2     29 312.96 -1   -42.988 4.4584 0.04378 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The other approach suggested is na.omit on the data frame prior to the lm() call. At first I thought this would be impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars), but not because of speed. This approach requires manual bookkeeping of which vars should be sanitized of NAs and which shouldn't, and is precisely what the OP seems intent to avoid. The biggest drawback would be that you must always keep in sync the formula with the subsetted data frame.

This however can be overcome rather easily, as it turns out:

data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear + 
                    carb, mtcars) ) 
## [1] 21
nobs( xb <- update(xa, .~.-cyl) )  ##not nested
## [1] 22
nobs( xb <- update_nested(xa, .~.-cyl) )  ##nested
## [1] 21
nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) )  ##nested
## [1] 21
all.equal(xb, xc)
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     10 104.08                           
## 2     11 104.42 -1  -0.34511 0.0332 0.8591

How to update `lm` or `glm` model on same subset of data?

1 Answers1

Linked