I am trying to identify socio-economic variables determining livestock depredation by reintroduced tigers in India. I split my data into a 70-30 train-test and then ran gam()
, but there was a huge difference in accuracy between train and test results. Therefore I decided to cross validate my training data, however, I could not find any way to do it other than using caret
.
The problem is when using gam()
directly and when using the train
function, I am getting very different results (deviance, adjusted r square and REML score), with the train function model explaining variance very poorly as compared to a direct gam()
, even when I am not defining tuning and training parameters.
I wonder if caret is doing some feature selection on its own because the formula bar (for train result) shows a few levels of categorical variables but not all, for eg: if I have four categorical explanatory variables: a,b,c,d the formula for the direct gam is
y ~ a+b+c+d
but for train, it is (missing many categories like b2, b3 and d2)
`a1+a2+b1+b4+c1+c2+d1+d3`
I am not sure if I should use caret results because I clearly am missing something here, but if not caret, which other package can I use to get cross validated accuracy and AUC of my gam model? (cvgam has been removed from repository it seems).
I am sorry I am not able to share my data because of some issues. But my data set is rather small (n=330) with some imbalance (loss=90, no loss=240). Here are examples of codes I am using for both methods...
gam()
directly from mgcv:gam_train <- gam(Loss_tiger ~ s(tot_liv) + graze_forest + Commun_code + Att_Tigr, data = train3, family = binomial, method = "REML")
using the
train
function:gam_train_cv <- train(Loss_tiger ~ tot_liv + graze_forest + Commun_code + Att_Tigr, data = train3, method = "gam", family = "binomial", trControl = trainControl(method = "LOOCV", number = 1), tuneGrid = data.frame(method = "REML", select = FALSE))