For background, I asked this questions a couple of weeks ago: How to create a for loop to go through multiple year combinations for a glm in R?
In summary, I have 7 years of data and am trying to create logistic regression glms using 1 year of data, 2 years of data (every combination of the seven years), 3 years of data (every combination of the seven years), etc. until 7 years of data.
User @Parfait helped me a ton in creating the basis of a code to go through every combination of years (7 years total) and to find the deviance, etc. of the model. I would now like to look at different metrics rather than AIC, deviance, etc. Specifically, I would like to use testing and training data and go through the process of prediction and a confusion matrix to get an overall accuracy value.
Here is some example data:
Blue_allyears <- data.frame(
Survey_Yea = sample(2005:2014, 500, replace=TRUE),
Pres_Abs = sample(0:1, 500, replace=TRUE),
TestData = sample(0:1, 500, replace=TRUE),
ca_10mbath = runif(500),
ca_10m_cur = runif(500),
ca_10m_eas = runif(500),
ca10_bpi30 = runif(500),
ca10_bpi24 = runif(500)
)
Blue_allyears
And here is the code I have been trying to adapt. Setting up the function:
run_model <- function(vec, yr) {
# subset data by years
sub <-blue_test_train[blue_test_train$Survey_Yea %in% vec,]
# dynamically generate formula
fmla <- Pres_Abs~ca_10mbath+ca_10m_cur+ca_10m_eas+ca10_bpi30+ca10_bpi24
# fit glm model
fit<-glm(fmla,data=sub[sub$TestData=="0",],family=binomial(link=logit))
#get predictions
trainpredict <- predict(fit, newdata=sub[sub$TestData=="1",], type="response")
# confusion matrix
cm<- confusionMatrix( trainpredict,reference=sub$Pres_Abs[sub$TestData=="1",])
overall.accuracy <- cm$overall['Accuracy']
# create temporary data frame
df <- data.frame(
Survey_Yea = paste(vec, collapse=", "),
overall.accuracy=overall.accuracy,
stringsAsFactors = F)
return(df)
}
running the function:
years <- sort(unique(blue_test_train$Survey_Yea))
# RETURN NESTED LIST OF MANY DATA FRAMES
results_df_list <- lapply(1:7, function(i) combn(
years, i, run_model, simplify=FALSE, yr=i)
)
# RETURN FLATTENED LIST OF DATA FRAMES AND
# RENAME ELEMENTS
results_df_list <- setNames(
lapply(results_df_list, function(dfs) do.call(rbind, dfs)),
c("years_1", "years_2", "years_3", "years_4","years_5","years_6","years_7")
)
# REVIEW EMBEDDED DATA FRAMES
b1<-(results_df_list$years_1)
b2<-(results_df_list$years_2)
b3<-(results_df_list$years_3)
b4<-(results_df_list$years_4)
b5<-(results_df_list$years_5)
b6<-(results_df_list$years_6)
b7<-(results_df_list$years_7)
blue_inyears<-rbind(b1,b2,b3,b4,b5,b6,b7)
blue_inyears
Here is the current error code I et:
Error in [.default
(sub$Pres_Abs, sub$TestData == "1", ) :
incorrect number of dimensions
I've also tried subsetting sub into sub2 and sub3 with just the training and testing data, respectively, as well as using different methods for the confusion matrix. I've also gotten some type of error.
Any help is much appreciated.
Thank you!