1

I am aware of the question GBM: Object 'p' not found; however it did not contain sufficient information to allow the stack to answer. I don't believe this is a duplicate as I've followed what was indicated in this question and the linked duplicate Error in R gbm function when cv.folds > 0 which, does not describe the same error.

I have been sure to follow the recommendation of leaving out any columns that were not used in the model.

This error appears when the cv.folds is greater than 0: object 'p' not found

From what I can see, setting cv.folds to 0 is not producing meaningful outputs.I have attempted different distributions, fractions, trees etc. I'm confident I've parameterized something incorrectly but I can't for the life of me see what it is.

Model and output:

model_output <- gbm(formula = ign ~ . , 
                  distribution = "bernoulli",
                  var.monotone = rep(0,9),
                  data = model_sample,
                  train.fraction = 0.50,
                  n.cores = 1,
                  n.trees = 150,
                  cv.folds = 1,
                  keep.data = T,
                  verbose=T)
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.1000       nan
     2           nan             nan     0.1000       nan
     3           nan             nan     0.1000       nan
     4           nan             nan     0.1000       nan
     5           nan             nan     0.1000       nan
     6           nan             nan     0.1000       nan
     7           nan             nan     0.1000       nan
     8           nan             nan     0.1000       nan
     9           nan             nan     0.1000       nan
    10           nan             nan     0.1000       nan
    20           nan             nan     0.1000       nan
    40           nan             nan     0.1000       nan
    60           nan             nan     0.1000       nan
    80           nan             nan     0.1000       nan
   100           nan             nan     0.1000       nan
   120           nan             nan     0.1000       nan
   140           nan             nan     0.1000       nan
   150           nan             nan     0.1000       nan

Minimum data to generate error used to be here, however once the suggest by @StupidWolf is employed it is too small, the suggestion below will get passed the initial error. Subsequent errors are occurring and solutions will be posted here upon discovery.

Badger
  • 1,043
  • 10
  • 25
  • 1
    There are no count of zeroes in your data. only 1s are present in ign column....please see, 100 records all the values of ign column are coming as 1 – PKumar Feb 19 '20 at 16:37
  • 1
    Hmm I have never seen this kind of error with gbm. Do you mind sharing the data somehow? I tried using your example, and randomised the ign column to 0,1s and it runs ok. – StupidWolf Feb 19 '20 at 17:25
  • PKumar and StupidWolf, thank you for your input, I've updated the data to include some real 0's, it's now a balanced dataset, apologies there. StupidWolf, I'm curious how you ended up getting it to run. I've ensured I have 0's now and I am still encountering the error. – Badger Feb 19 '20 at 18:11
  • I'm wondering if you need to scale the predictors. You have one that is really small. – IRTFM Feb 19 '20 at 20:45
  • @Badger, you need to set cv.folds > 1 . See below on the explanation – StupidWolf Feb 19 '20 at 22:26

1 Answers1

1

It's not meant to deal with the situation someone sets cv.folds = 1. By definition, k fold means splitting the data into k parts, training on 1 part and testing on the other.. So I am not so sure what is 1 -fold cross validation, and if you look at the code for gbm, at line 437

  if(cv.folds > 1) {
    cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
    ....
    p <- cv.results$predictions
}

It makes the predictions and when it collects the results into gbm, line 471:

  if (cv.folds > 0) { 
    gbm.obj$cv.fitted <- p 
  }

So if cv.folds ==1, p is not calculated, but it is > 0 hence you get the error.

Below is a reproducible example:

library(MASS)
test = Pima.tr 
test$type = as.numeric(test$type)-1

model_output <- gbm(type~ . , 
                  distribution = "bernoulli",
                  var.monotone = rep(0,7),
                  data = test,
                  train.fraction = 0.5,
                  n.cores = 1,
                  n.trees = 30,
                  cv.folds = 1,
                  keep.data = TRUE,
                  verbose=TRUE)

gives me the error object 'p' not found

Set it to cv.folds = 2, and it runs smoothly....

model_output <- gbm(type~ . , 
                  distribution = "bernoulli",
                  var.monotone = rep(0,7),
                  data = test,
                  train.fraction = 0.5,
                  n.cores = 1,
                  n.trees = 30,
                  cv.folds = 2,
                  keep.data = TRUE,
                  verbose=TRUE)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I was going down the same layer of inquiry myself, poking at the subroutines to see where I was having issues tipping over. Thank you for the clarification. I have a new issue in that my system doesn't seem to like the way the clusters are being setup and it is crashing R, but that's a different problem. Thank you again! – Badger Feb 21 '20 at 16:35
  • Cool, you're welcome :) Should have figured it out sooner. You are using gbm, parallel, with quite a few cores? It can be quite unstable with many threads.. – StupidWolf Feb 22 '20 at 16:13
  • Strangely enough, no, only a single core is being declared. Interestingly your example works without issue. I'm going to go out on a limb and say I have some issue within my dataset, when I tease it out, I'll post back here to attempt to help subsequent users! – Badger Feb 24 '20 at 15:28
  • And my own stupidity has gotten the best of me. I had a factorial response in the system, which GBM apparently doesn't like very much, so it just crashed the whole thing rather than warning of the issue. – Badger Feb 24 '20 at 15:51
  • cool :) glad it works for you eventually. I make that mistake sometimes when I first worked with it. Maybe what helps was also using xgboost, which kind of requires this bernoulli response, hence I got used to that.. – StupidWolf Feb 24 '20 at 16:10
  • Worked is a relative term ;). Failing differently is more accurate, it's not achieving any predictions. Which says to me there is a convergence problem, meaning, once again, my data is bunk haha. – Badger Feb 24 '20 at 16:13