5

I have been trying to use biglm to run linear regressions on a large dataset (approx 60,000,000 lines). I want to use AIC for model selection. However I discovered when playing with biglm on smaller datasets that the AIC variables returned by biglm are different from those returned by lm. This even applies to the example in the biglm help.

data(trees)
ff<-log(Volume)~log(Girth)+log(Height)

chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]

library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)

AIC(a)#48.18546

a_lm <- lm(ff, trees)
AIC(a_lm)#-62.71125

Can someone please explain what is happening here? Are the AICs generated with biglm safe to use for comparing biglm models on the same dataset?

merv
  • 67,214
  • 13
  • 180
  • 245
  • 3
    You shouldn't expect them to be the same - different authors may elide constants. Generally, it's not a good idea to compare AICs across model families unless they have been carefully constructed to be comparable. – hadley Feb 12 '14 at 22:12
  • You would have to look at the source code and compare formulae used for AIC. Perhaps differences in AIC could even arise depending on how the parameters are estimated. By the way, your code does not run. If your code actually ran that might help people who are trying to answer your question. – Mark Miller Feb 12 '14 at 22:40
  • Your code runs and returns the same answers you provided if I add the following code: `library(biglm)`. However, I am uncertain whether adding code to a question is an acceptable use of the edit feature. So, I have not added that code to your post. – Mark Miller Feb 13 '14 at 17:20
  • 2
    @MarkMiller, it seems reasonable to me, so I went ahead and did it. – Ben Bolker Feb 13 '14 at 21:12
  • apologies for the typo, and missing the library, thanks to Gavin Simpson and Ben Bolker for fixing my code – user3303687 Feb 14 '14 at 14:37

2 Answers2

5

tl;dr it looks to me like there is a pretty obvious bug in the AIC method for biglm-class objects (more specifically, in the update method), in the current (0.9-1) version, but the author of the biglm package is a smart, experienced guy, and biglm is widely used, so perhaps I'm missing something. Googling for "biglm AIC df.resid", it seems this has been discussed way back in 2009? Update: the package author/maintainer reports via e-mail that this is indeed a bug.

Something funny seems to be going on here. The differences in AIC between models should be the same across modeling frameworks, whatever the constants that have been used and however parameters are counted (because these constants and parameter-counting should be consistent within modeling frameworks ...)

Original example:

data(trees)
ff <- log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
a_lm <- lm(ff, trees)

Now fit a reduced model:

ff2 <- log(Volume)~log(Girth)    
a2 <- biglm(ff2, chunk1)
a2 <- update(a2, chunk2)
a2 <- update(a2 ,chunk3)
a2_lm <- lm(ff2,trees)

Now compare AIC values:

AIC(a)-AIC(a2)
## [1] 1.80222

AIC(a_lm)-AIC(a2_lm)
## [1] -20.50022

Check that we haven't screwed something up:

all.equal(coef(a),coef(a_lm))  ## TRUE
all.equal(coef(a2),coef(a2_lm))  ## TRUE

Look under the hood:

biglm:::AIC.biglm
## function (object, ..., k = 2) 
##    deviance(object) + k * (object$n - object$df.resid)

In principle this is the right formula (number of observations minus residual df should be the number of parameters fitted), but digging in, it looks like the $df.resid component of the objects hasn't been updated properly:

a$n  ## 31, correct
a$df.resid  ## 7, only valid before updating!

Looking at biglm:::update.biglm, I would add

object$df.resid <- object$df.resid + NROW(mm)

right before or after the line that reads

object$n <- object$n + NROW(mm)

...

This seems like a fairly obvious bug to me, so perhaps I'm missing something obvious, or perhaps it has been fixed.

A simple workaround would be to define your own AIC function as

AIC.biglm <- function (object, ..., k = 2) {
    deviance(object) + k * length(coef(object))
}

AIC(a)-AIC(a2)  ## matches results from lm()

(although note that AIC(a_lm) is still not equal to AIC(a), because stats:::AIC.default() uses 2*log-likelihood rather than deviance (these two measures differ in their additive coefficients) ...)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks @Ben for the very detailed answer. Just a question for clarification, after running your AIC.biglm function, and comparing AIC(a) and AIC(a2) again, I expected to get an answer of -20.50022, to match the difference between AIC(a_lm and AIC(a2_lm), but I don't, I still get 1.80222. Was my expectation of getting the same results wrong, or should I be looking for a mistake in my code? - Thanks again! – user3303687 Feb 14 '14 at 15:52
  • 1
    I think I was wrong. I think my `AIC.biglm` doesn't actually take effect, R probably still looks in the `biglm` namespace for the appropriate accessor method ... will correct when I get a chance. – Ben Bolker Feb 14 '14 at 18:21
1

I have played around with this a bit. I am not certain, but I think the formula for AIC used by the package biglm is:

2 * (n.parameters  + obs.added - 1) + deviance(a)

where obs_added is the number of observations in chunk2 plus the number of observations in chunk3:

obs.added <- dim(chunk2)[1] + dim(chunk3)[1]

and n.parameters is the number of estimated coefficients returned by summary(a) + 1 (where the +1 is for the error term), and deviance(a) is the deviance of your model a.

####################################################

data(trees)

ff  <- log(Volume)~log(Girth)+log(Height)

n.parm  <- 4

chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]

obs.added <- dim(chunk2)[1] + dim(chunk3)[1]

library(biglm)

a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
AIC(a)
summary(a)
deviance(a)

2 * (n.parm  + obs.added - 1) + deviance(a)

round(AIC(a), 5) == round(2 * (n.parm  + obs.added - 1) + deviance(a), 5)

# [1] TRUE

####################################################

Since I am not 100% certain my answer is correct, you can play around with the code below and see whether you can find a scenario where the proposed formula for AIC does not work. If I find any such scenarios I will attempt to modify the code below and the formula above as necessary.

#########################################################

# Generate some data

n      <- 118                                      # number of observations
B0     <-  2                                       # intercept
B1     <- -1.5                                     # slope 1
B2     <-  0.4                                     # slope 2
B3     <-  2.0                                     # slope 3
B4     <- -0.8                                     # slope 4

sigma2 <-  5                                       # residual variance

x1     <- round(runif(n, -5 ,  5), digits = 3)     # covariate 1
x2     <- round(runif(n, 10 , 20), digits = 3)     # covariate 2
x3     <- round(runif(n,  2 ,  8), digits = 3)     # covariate 3
x4     <- round(runif(n, 12 , 15), digits = 3)     # covariate 4

eps    <- rnorm(n, mean = 0, sd = sqrt(sigma2))    # error
y      <- B0 + B1 * x1 + B2 * x2 + B3 * x3 + B4 * x4 + eps   # dependent variable

my.data <- data.frame(y, x1, x2, x3, x4)

# analyze data with linear regression

model.1 <- lm(my.data$y ~ my.data$x1 + my.data$x2 + my.data$x3 + my.data$x4)
summary(model.1)

AIC(model.1)

n.parms <- length(model.1$coefficients) + 1

my.AIC <- 2 * n.parms - 2 * as.numeric(logLik(model.1))
my.AIC

#########################################################

ff0 <- y ~  1
ff1 <- y ~ x1
ff2 <- y ~ x1 + x2
ff3 <- y ~ x1 + x2 + x3
ff4 <- y ~ x1 + x2 + x3 + x4

n.parm0 <- 2
n.parm1 <- 3
n.parm2 <- 4
n.parm3 <- 5
n.parm4 <- 6

n.chunks <- 5

chunk1<-my.data[                                    1:round(((nrow(my.data)/n.chunks)*1)+0),]
chunk2<-my.data[round(((nrow(my.data)/n.chunks)*1)+1):round(((nrow(my.data)/n.chunks)*2)+0),]
chunk3<-my.data[round(((nrow(my.data)/n.chunks)*2)+1):round(((nrow(my.data)/n.chunks)*3)+0),]
chunk4<-my.data[round(((nrow(my.data)/n.chunks)*3)+1):round(((nrow(my.data)/n.chunks)*4)+0),]
chunk5<-my.data[round(((nrow(my.data)/n.chunks)*4)+1):nrow(my.data),]

obs.added <- dim(chunk2)[1] + dim(chunk3)[1] + dim(chunk4)[1] + dim(chunk5)[1]

# check division of data

foo <- list()

foo[[1]]  <- chunk1
foo[[2]]  <- chunk2
foo[[3]]  <- chunk3
foo[[4]]  <- chunk4
foo[[5]]  <- chunk5

all.data.foo <- do.call(rbind, foo)

all.equal(my.data, all.data.foo)

####################################################

library(biglm)

####################################################

a0 <- biglm(ff0, chunk1)
a0 <- update(a0, chunk2)
a0 <- update(a0, chunk3)
a0 <- update(a0, chunk4)
a0 <- update(a0, chunk5)
AIC(a0)
summary(a0)
deviance(a0)
print(a0)

2 * (n.parm0  + obs.added - 1) + deviance(a0)

round(AIC(a0), 5) == round(2 * (n.parm0  + obs.added - 1) + deviance(a0), 5)

####################################################

a1 <- biglm(ff1, chunk1)
a1 <- update(a1, chunk2)
a1 <- update(a1, chunk3)
a1 <- update(a1, chunk4)
a1 <- update(a1, chunk5)
AIC(a1)
summary(a1)
deviance(a1)
print(a1)

2 * (n.parm1  + obs.added - 1) + deviance(a1)

round(AIC(a1), 5) == round(2 * (n.parm1  + obs.added - 1) + deviance(a1), 5)

####################################################

a2 <- biglm(ff2, chunk1)
a2 <- update(a2, chunk2)
a2 <- update(a2, chunk3)
a2 <- update(a2, chunk4)
a2 <- update(a2, chunk5)
AIC(a2)
summary(a2)
deviance(a2)
print(a2)

2 * (n.parm2  + obs.added - 1) + deviance(a2)

round(AIC(a2), 5) == round(2 * (n.parm2  + obs.added - 1) + deviance(a2), 5)

####################################################

a3 <- biglm(ff3, chunk1)
a3 <- update(a3, chunk2)
a3 <- update(a3, chunk3)
a3 <- update(a3, chunk4)
a3 <- update(a3, chunk5)
AIC(a3)
summary(a3)
deviance(a3)
print(a3)

2 * (n.parm3  + obs.added - 1) + deviance(a3)

round(AIC(a3), 5) == round(2 * (n.parm3  + obs.added - 1) + deviance(a3), 5)

####################################################

a4 <- biglm(ff4, chunk1)
a4 <- update(a4, chunk2)
a4 <- update(a4, chunk3)
a4 <- update(a4, chunk4)
a4 <- update(a4, chunk5)
AIC(a4)
summary(a4)
deviance(a4)
print(a4)

2 * (n.parm4  + obs.added - 1) + deviance(a4)

round(AIC(a4), 5) == round(2 * (n.parm4  + obs.added - 1) + deviance(a4), 5)

####################################################

EDIT

I suggested biglm uses the following equation for AIC:

2 * (n.parameters  + obs.added - 1) + deviance(a)

Ben Bolker pointed out that the equation biglm uses for AIC is:

deviance(object) + k * (object$n - object$df.resid)

Ben also determined that biglm was not updating the first value for residual df.

Given that new information, I now see that the two equations are equivalent.
First, restrict the two equations to the following, which is the only place they differ:

(n.parameters  + obs.added - 1) # mine

(object$n    - object$df.resid) # Ben's

Re-arrange mine to account for me adding 1 to the number of parameters and then subtracting one:

((n.parameters-1) + obs.added) = ((4-1) + obs.added) = (3 + 21) = 24

Now morph my equation into Ben's:

My 3 is the same as:

(number of observations in chunk1 - object$df.resid) = (10 - 7) = 3

giving:

((number of obs in chunk1 - object$df.resid) + obs.added) = ((10-7) + 21)

or:

(3 + 21) = 24

Re-arrange:

((number of obs in chunk1 + obs.added) - object$df.resid) = ((10 + 21) - 7)

or:

(31 - 7) = 24

And:

((number of observations in chunk1 + obs.added) - object$df.resid)

is the same as:

(total number of observations - object$df.resid)

Which is the same as:

(object$n - object$df.resid) = (31 - 7) = 24

It appears the equation I proposed really is the equation biglm uses for AIC, just expressed in a different form.

Of course, I was only able to realize this because Ben provided both the critical code and the critical explanation of the error.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132