4

I am attempting to build a model to predict whether a product will get sold on an ecommerce website with 1 or 0 being the output.

My data is a handful of categorical variables, one with a large amount of levels, a couple binary, and one continuous (the price), with an output variable of 1 or 0, whether or not the product listing got sold.

This is my code:

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]


gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~.,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain, 
           method="gbm", 
           verbose=FALSE, 
           trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain, 
                  method="gbm", 
                  metric="ROC", 
                  verbose=FALSE , 
                  trControl=ctrl)



  grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

  gbmTune<-train(Sale~., data=CTrain, 
           method="gbm", 
           metric="ROC", 
           tunegrid= grid, 
           verebose=FALSE,
           trControl=ctrl)



  set.seed(1)
  gbmTune <- train(Sale~., data = CTrain,
               method = "gbm",
               metric = "ROC",
               tuneGrid = grid,
               verbose = FALSE,
               trControl = ctrl)

I am running into two issues. The first is when I attempt add the summaryFunction=twoClasssummary, and then tune I get this:

Error in trainControl(Sale ~ ., data = CTrain, method = "gbm", metric = "ROC",  : 
  unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)

The second problem if I decide bypass the summaryFunction, is when I try and run the model I get this error:

Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,  : 
  train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
  cannnot compute class probabilities for regression

I tried changing the output variable from a numeric value of 1 or 0, to just a text value, in excel, but that didn't make a difference.

Any help would be greatly appreciated on how to fix the fact that it's interpreting this model as a regression, or the first error message I am encountering.

Best,

Will will@nubimetrics.com

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Will Bunker
  • 47
  • 1
  • 1
  • 5
  • Please check out [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). You've included a bunch of code but no sample data so we cannot run it to reproduce the same error. This makes it much harder to help you. – MrFlick Oct 15 '14 at 18:56
  • @WillBunker it's pretty close to reproducible if you could just use one of the built in datasets, verify that your errors still exist with that data, and let us know which one to use. You can run `data()` to see the datasets in `caret` like `GermanCredit` – Hack-R Oct 15 '14 at 19:08
  • Okay cool. I'll run it with the GermanCredit and try that. Thank you for the consideration. – Will Bunker Oct 15 '14 at 19:11
  • @WillBunker cool! I am doing the same thing. By the way, what's `gbm`? It's not from the package `caret`. UPDATE: ah, i see it's from the package `gbm` and appears to be gradient boosting – Hack-R Oct 15 '14 at 19:14

2 Answers2

5

Your outcome is:

Sale = c(1L, 0L, 1L, 1L, 0L))

Although gbm expects it this way, it is pretty unnatural way to encode the data. Almost every other function uses factors.

So if you give train numeric 0/1 data, it thinks that you want to do regression. If you convert this to a factor and used "0" and "1" as the levels (and if you want class probabilities), you should have seen a warning that says "At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to...". That is not an idle warning.

Use factor levels that are valid R variable names and you should be fine.

Max

topepo
  • 13,534
  • 3
  • 39
  • 52
  • Hey Max, thanks for the input. I'm not sure why the data ended up showing up like that when after using the dp(head). The data I am actually running the caret package on is coded as Sale:0/1, but I think my issue, as Hack-R, pointed out was thinking I could use traincontrol as train. – Will Bunker Oct 16 '14 at 14:40
3

I was able to reproduce your error using the data(GermanCredit) dataset.

Your error comes from using trainControl as if it were gbm, train, or something.

If you check out the vignette's related documentation with ?trainControl then you will see that it's looking for input that's a lot different from what you're giving it.

This works:

require(caret)
require(gbm)
data(GermanCredit)

# Your dependent variable was Sale and it was binary
#   in place of Sale I will use the binary variable Telephone 

C      <- GermanCredit
C$Sale <- GermanCredit$Telephone

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
            distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, 
               method="gbm", 
               verbose=FALSE, 
               trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)

# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain, 
#                       method="gbm", 
#                       metric="ROC", 
#                       verbose=FALSE , 
#                       trControl=ctrl)

gbmTune <- trainControl(method = "adaptive_cv", 
                      repeats = 5,
                      verboseIter = TRUE,
                      seeds = seeds)

grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain, 
               method="gbm", 
               metric="ROC", 
               tunegrid= grid, 
               verebose=FALSE,
               trControl=ctrl)



set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
                 method = "gbm",
                 metric = "ROC",
                 tuneGrid = grid,
                 verbose = FALSE,
                 trControl = ctrl)

Depending on what you're trying to accomplish you may want to re-specify that a little differently, but all it boils down to is that you used trainControl as if it were train.

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • Okay, I guess I was stupid to assume the example I was working off of was similar enough so I could replicate the same problem with my dataset. Thank you very much. When you say re-specify, do you have any suggestions moving forward? The answer to my question is simple, predict product listing outcome: Sale or No Sale, execution is a lot more complicated naturally. Thanks again for the help. – Will Bunker Oct 15 '14 at 19:29
  • @WillBunker You're welcome. By respecify I meant that you could set the options to your liking such as the method and chose your X variables. What I have there should work, but I don't have your original dataset so I don't really know which choices are ideal. To do the prediction you're talking about you just need the regression that's being specified in the `train` statements. So take your fitted models and apply the coefficients to the validation data to get predictions. With regression models this can be done with `predict`, `prediction`, `predictOMatic` etc. Gave you +1 for the question. – Hack-R Oct 15 '14 at 20:04
  • Right, I've used the predict function before, not in caret, but I'm assuming it runs in a similar fashion to other r packages. Thanks for clarifying. – Will Bunker Oct 15 '14 at 20:10
  • @WillBunker Np. This may help for the caret-specific prediction: http://www.inside-r.org/packages/cran/caret/docs/extractPrediction – Hack-R Oct 15 '14 at 20:28
  • Thanks for the resource! I am getting the error, unsused argument when I run it with "seeds=seeds? – Will Bunker Oct 15 '14 at 20:52
  • @WillBunker You're welcome. Are you saying that you put `seeds=seeds` into your `predict` statement? I don't see it listed as an option in that link so I would expect it to be an unused argument. I might be misunderstanding something though? – Hack-R Oct 15 '14 at 20:55
  • No, sorry I should have clarified. When I try and run the code you pasted above, I get the error of "unused argument, seeds=seeds", with the German dataset. I can show you on Skype quickly if that's easier. – Will Bunker Oct 15 '14 at 20:59
  • @WillBunker I would take you up on the Skype idea but I'm about to depart from my current location. I see the problem, it's my fault. You also need to define `seeds` such as `set.seed(123) seeds <- vector(mode = "list", length = 51) for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)` – Hack-R Oct 15 '14 at 21:13
  • Not a problem, I'll try this out and try and work through it. Thanks for sticking with me. I am new to R and the learning curve is killing me right now. – Will Bunker Oct 15 '14 at 21:14
  • @WillBunker No worries. R definitely gets easier with practice. Feel free to upvote my solution if it was helpful :) – Hack-R Oct 15 '14 at 21:16
  • As soon as my rep gets to 15, I'll definitely do it. – Will Bunker Oct 15 '14 at 21:19
  • @WillBunker ah, fair enough! – Hack-R Oct 15 '14 at 21:22