0

I am trying to identify socio-economic variables determining livestock depredation by reintroduced tigers in India. I split my data into a 70-30 train-test and then ran gam(), but there was a huge difference in accuracy between train and test results. Therefore I decided to cross validate my training data, however, I could not find any way to do it other than using caret.

The problem is when using gam() directly and when using the train function, I am getting very different results (deviance, adjusted r square and REML score), with the train function model explaining variance very poorly as compared to a direct gam(), even when I am not defining tuning and training parameters.

I wonder if caret is doing some feature selection on its own because the formula bar (for train result) shows a few levels of categorical variables but not all, for eg: if I have four categorical explanatory variables: a,b,c,d the formula for the direct gam is

y ~ a+b+c+d 

but for train, it is (missing many categories like b2, b3 and d2)

`a1+a2+b1+b4+c1+c2+d1+d3`

I am not sure if I should use caret results because I clearly am missing something here, but if not caret, which other package can I use to get cross validated accuracy and AUC of my gam model? (cvgam has been removed from repository it seems).

I am sorry I am not able to share my data because of some issues. But my data set is rather small (n=330) with some imbalance (loss=90, no loss=240). Here are examples of codes I am using for both methods...

  1. gam() directly from mgcv:

    gam_train <- gam(Loss_tiger ~ s(tot_liv) + graze_forest + Commun_code + Att_Tigr, data = train3, family = binomial, method = "REML")

  2. using the train function:

    gam_train_cv <- train(Loss_tiger ~ tot_liv + graze_forest + Commun_code + Att_Tigr, data = train3, method = "gam", family = "binomial", trControl = trainControl(method = "LOOCV", number = 1), tuneGrid = data.frame(method = "REML", select = FALSE))

Shawn Hemelstrand
  • 2,676
  • 4
  • 17
  • 30
  • 1
    https://stackoverflow.com/questions/41663516/caret-package-cross-validating-gam-with-both-smooth-and-linear-predictors you might wanna read this to see what method="gam" is doing – StupidWolf Jul 18 '20 at 17:03
  • I don't think you can use the spline on 1 variable. Most likely you have to write something to calculate the CV accuracy – StupidWolf Jul 18 '20 at 17:04
  • Hi many thanks for your comments! Actually since in the example code only tot_liv is continuous, rest are categorical i have added s() for it alone. Also I did try checking the formula but its returning NULL. Sadly the thread you suggested also has remain unresolved. – Manjari Malviya Jul 28 '20 at 11:56

0 Answers0