How to predict a glm using a function to go through different combination of years of data?

Question

For background, I asked this questions a couple of weeks ago: How to create a for loop to go through multiple year combinations for a glm in R?

In summary, I have 7 years of data and am trying to create logistic regression glms using 1 year of data, 2 years of data (every combination of the seven years), 3 years of data (every combination of the seven years), etc. until 7 years of data.

User @Parfait helped me a ton in creating the basis of a code to go through every combination of years (7 years total) and to find the deviance, etc. of the model. I would now like to look at different metrics rather than AIC, deviance, etc. Specifically, I would like to use testing and training data and go through the process of prediction and a confusion matrix to get an overall accuracy value.

Here is some example data:

Blue_allyears <- data.frame(
  Survey_Yea = sample(2005:2014, 500, replace=TRUE),
  Pres_Abs = sample(0:1, 500, replace=TRUE),
 TestData = sample(0:1, 500, replace=TRUE),
  ca_10mbath = runif(500),
  ca_10m_cur = runif(500),
  ca_10m_eas = runif(500),
  ca10_bpi30 = runif(500),
  ca10_bpi24 = runif(500)
)
Blue_allyears

And here is the code I have been trying to adapt. Setting up the function:

run_model <- function(vec, yr) {
  # subset data by years
  
  sub <-blue_test_train[blue_test_train$Survey_Yea %in% vec,]

 
  # dynamically generate formula
        fmla <- Pres_Abs~ca_10mbath+ca_10m_cur+ca_10m_eas+ca10_bpi30+ca10_bpi24
    
        # fit glm model
        fit<-glm(fmla,data=sub[sub$TestData=="0",],family=binomial(link=logit))
      
#get predictions
       
       trainpredict <- predict(fit, newdata=sub[sub$TestData=="1",], type="response")
    
# confusion matrix
cm<- confusionMatrix( trainpredict,reference=sub$Pres_Abs[sub$TestData=="1",])
            overall.accuracy <- cm$overall['Accuracy']



        # create temporary data frame
        df <- data.frame( 
          Survey_Yea = paste(vec, collapse=", "), 
                       overall.accuracy=overall.accuracy,
          stringsAsFactors = F)

  
return(df)
}

running the function:

years <- sort(unique(blue_test_train$Survey_Yea))

# RETURN NESTED LIST OF MANY DATA FRAMES
results_df_list <- lapply(1:7, function(i) combn(
  years, i, run_model, simplify=FALSE, yr=i)
)

# RETURN FLATTENED LIST OF DATA FRAMES AND
# RENAME ELEMENTS
results_df_list <- setNames(
  lapply(results_df_list, function(dfs) do.call(rbind, dfs)),
  c("years_1", "years_2", "years_3", "years_4","years_5","years_6","years_7")
)

# REVIEW EMBEDDED DATA FRAMES
b1<-(results_df_list$years_1)
b2<-(results_df_list$years_2)
b3<-(results_df_list$years_3)
b4<-(results_df_list$years_4)
b5<-(results_df_list$years_5)
b6<-(results_df_list$years_6)
b7<-(results_df_list$years_7)
blue_inyears<-rbind(b1,b2,b3,b4,b5,b6,b7)
blue_inyears

Here is the current error code I et: Error in [.default(sub$Pres_Abs, sub$TestData == "1", ) : incorrect number of dimensions

I've also tried subsetting sub into sub2 and sub3 with just the training and testing data, respectively, as well as using different methods for the confusion matrix. I've also gotten some type of error.

Any help is much appreciated.

Thank you!

Since you are subsetting a vector and not a data frame remove the comma in this expression: `sub$Pres_Abs[sub$TestData=="1",]` — Parfait, Jun 04 '22 at 22:34
Thanks @Parfait! I tried that and the next error I got was: "Error: `data` and `reference` should be factors with the same levels." I think this has to do with the confusion matrix. I tried to change a few things but with no avail. I'll keep working on it though — Kinz, Jun 05 '22 at 01:54
Thanks @Parfait! I kept on getting a bunch of errors still, so I ended up subsetting the data from the beginning which helped. I also made some other adjustments. I posted the code in case you wanted to see. Thanks again for your help! — Kinz, Jun 06 '22 at 18:05

score 2 · Answer 1 · answered Jun 06 '22 at 18:04

Alrighty I think I figured it out... here is the updated code:

run_model <- function(vec, yr) {
  # subset data by years
  sub <-blue_test_train[blue_test_train$Survey_Yea %in% vec,]
sub1<-sub[sub$TestData=="0",]
sub2<-sub[sub$TestData=="1",]

  # dynamically generate formula
        fmla <- Pres_Abs~ca_10mbath+ca_10m_cur+ca_10m_eas+ca10_bpi30+ca10_bpi24
    
        # fit glm model
        fit<-glm(fmla,data=sub1,family=binomial(link=logit))

#get predictions
       sub2$predict <- predict(fit, newdata=sub2, type="response")
        
       threshold <- 0.5
       
      cm <-confusionMatrix(factor(sub2$predict>threshold),  factor(sub2$Pres_Abs==1), positive="TRUE")
    
# confusion matrix
            overall.accuracy <- cm$overall['Accuracy']

        # create temporary data frame
        df <- data.frame( 
          Survey_Yea = paste(vec, collapse=", "), 
                       overall.accuracy=overall.accuracy,
          Years=yr,
          stringsAsFactors = F
          )

return(df)
}

I subsetted my data from the beginning which helped. I'm playing around with determining the "optimal" threshold rather than just using 0.5, but figured I'd share the code since it works now!

How to predict a glm using a function to go through different combination of years of data?

1 Answers1