0

I've built a rudimentary function to extract AIC and BIC values from 3 models I'm interested for several variables. However while it's running my computer often stops and says that it can't allocate 200MB to a vector (I'm using a large dataset- more than 500K cases and yes I've increased the memory limit to the max-4000).

I have actually managed to run it if I select a couple of variables at a time. I'm interested in actually running the function in one go but also improving my function code so that I don't have to delete everything else before running it and possibly not have to wait 30 minutes. I'm likely to use amended AIC and BIC formulas and add other things, so I'd rather keep the AIC and BIC vectorisation as it is and not switch to other logistic regression functions. I've played around with it and added things like rm(model1) but it probably makes very little difference. Would you be able to suggest code which solves the memory allocation problem and possibly speed the up the function?

Many thanks

The function:

myF<-function(mydata,TotScore,group){
  BIC2<-BIC1<-BIC0<-AIC2<-AIC1<-AIC0<-rep(NA,length(ncol(mydata)))
  for (i in (1:ncol(mydata))){
    M0<-glm(mydata[,i] ~ TotScore,family=binomial,data=mydata,x=F,y=F,model=F)
    AIC0[i]<-extractAIC(M0)[2]
    BIC0[i]<-extractAIC(M0,k=log(length(M0$fitted.values)))[2]
    rm(M0)
    M1<-glm(mydata[,i] ~ TotScore+group,family=binomial,data=mydata,x=F,y=F,model=F)
    AIC1[i]<-extractAIC(M1)[2]
    BIC1[i]<-extractAIC(M1,k=log(length(M1$fitted.values)))[2]
    rm(M1)
    M2<-glm(mydata[,i] ~ TotScore+group+TotScore*group,family=binomial,data=mydata,x=F,y=F,model=F)
    AIC2[i]<-extractAIC(M2)[2]
    BIC2[i]<-extractAIC(M2,k=log(length(M2$fitted.values)))[2]
    rm(M2)
  }
  Results<-cbind(AIC0,AIC1,AIC2,BIC0,BIC1,BIC2)
  rownames(Results)<-names(mydata)
  return(Results) 
}

P.S. The model can be tried with

##Random dataset example
v1<-sample(0:1, 500000, replace=TRUE, prob=c(.80,.20))
v2<-sample(0:1, 500000, replace=TRUE, prob=c(.85,.15))
v3<-sample(0:1, 500000, replace=TRUE, prob=c(.95,.05))
mydata<-as.data.frame(cbind(v1,v2,v3))
TotScore=rowSums(mydata)
group<-(rep (1:5,100000))
myF(mydata,TotScore,group)
Marco M
  • 623
  • 1
  • 8
  • 15
  • 3
    Welcome to StackOverflow. Perhaps if you made a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that demonstrates your question / problem, people would find it easier to answer. – Andrie Aug 07 '12 at 12:55
  • Apologies, small dataset example attached. – Marco M Aug 07 '12 at 13:11

2 Answers2

2

The nice thing about binomial data with discrete predictors is that you can aggregate the data without loss of information

set.seed(12345)
v1<-sample(0:1, 500000, replace=TRUE, prob=c(.80,.20))
v2<-sample(0:1, 500000, replace=TRUE, prob=c(.85,.15))
v3<-sample(0:1, 500000, replace=TRUE, prob=c(.95,.05))
mydata<-as.data.frame(cbind(v1,v2,v3))
mydata$TotScore <- rowSums(mydata)
mydata$group <- rep (1:5,100000)

library(reshape)
myFun2 <- function(Y, dataset){
  tmp <- as.data.frame(table(TotScore = dataset$TotScore, group = dataset$group, Response = dataset[, Y]))
  levels(tmp$Response) <- c("Failure", "Succes")
  tmp <- cast(TotScore + group ~ Response, data  = tmp, value = "Freq")
  tmp$TotScore <- as.numeric(levels(tmp$TotScore))[tmp$TotScore]
  output <- rep(NA, 6)
  names(output) <- paste(rep(c("AIC", "BIC"), 3), rep(0:2, each = 2), sep = "")
  m <- glm(cbind(Succes, Failure) ~ TotScore, data = tmp, family = binomial,
           model = FALSE, x = FALSE, y = FALSE)
  output[1:2] <- c(AIC(m), BIC(m))
  m <- glm(cbind(Succes, Failure) ~ TotScore + group, data = tmp, family = binomial,
           model = FALSE, x = FALSE, y = FALSE)
  output[3:4] <- c(AIC(m), BIC(m))
  m <- glm(cbind(Succes, Failure) ~ TotScore * group, data = tmp, family = binomial,
           model = FALSE, x = FALSE, y = FALSE)
  output[5:6] <- c(AIC(m), BIC(m))
  output
}


system.time({
  sapply(colnames(mydata)[1:3], myFun, dataset = mydata)
})
   user  system elapsed 
  3.10    0.06    3.15 
Thierry
  • 18,049
  • 5
  • 48
  • 66
  • Dear Thierry, thank you for all your efforts. Reducing the data to table format does improve things substantially in terms of time and memory. However, The AIC and BIC values for the reduced (table) data are not the same as the full dataset ones, so I think they're not really comparable to other analyses, rules of thumb etc.If you try the BIC()function on the table data and on the normal data you'll see what I mean. BIC is particularly ("bad") as the difference is not the same with full and table data (while AIC difference between models is the same because it doesn't include the sample size). – Marco M Aug 08 '12 at 10:07
0
library(difR)
data(verbal)
verbal$TotScore <- rowSums(verbal[, c(1:24)])
verbal$group <- with(verbal, factor(Gender):factor(Anger > 20))

myFun <- function(Y, dataset){
  output <- rep(NA, 6)
  names(output) <- paste(rep(c("AIC", "BIC"), 3), rep(0:2, each = 2), sep = "")
  m <- glm(as.formula(paste(Y, "~ TotScore")), data = dataset, family = binomial,
      model = FALSE, x = FALSE, y = FALSE)
  output[1:2] <- c(AIC(m), BIC(m))
  m <- glm(as.formula(paste(Y, "~ TotScore + group")), data = dataset, 
     family = binomial, model = FALSE, x = FALSE, y = FALSE)
  output[3:4] <- c(AIC(m), BIC(m))
  m <- glm(as.formula(paste(Y, "~ TotScore * group")), data = dataset, 
      family = binomial, model = FALSE, x = FALSE, y = FALSE)
  output[5:6] <- c(AIC(m), BIC(m))
  output
}

sapply(colnames(verbal)[1:2], myFun, dataset = verbal)
Thierry
  • 18,049
  • 5
  • 48
  • 66
  • I could not show any speed gains (I did something similar but didn't publish it for this reason). Can you elaborate on how your solution solves the memory or speed issue? – Roman Luštrik Aug 07 '12 at 13:41
  • You are probably right but it's my fault I should have given you a much bigger dataset to try it with so that you can test size and time. I've now amended the question (see above). Unfortunately Thierry's suggestion still makes my computer stall, but thanks for improving the function anyway. – Marco M Aug 07 '12 at 14:05