1

I've used sklearn for machine learning modelling over the last couple of years and grew accustomed to what seems like a very logical and cohesive framework:

from sklearn.ensemble import RandomForestClassifier

# define a model
clf = RandomForestClassifier()

# fit the model to data
clf.fit(X,y)

#make prediction on a test set
preds = clf.predict_proba(X_test)[:,1]

I'm now trying to learn some R, and want to start doing some of the same things I was doing in sklearn. The first thing that you notice coming from the sklearn world is the diverse syntax across packages. Which is understandable, but kind of inconvenient. caret seems like a nice solution to that problem, creating cohesion across all the different R packages (i.e. randomForest, gbm,...). Though I'm still puzzled by some of default choices (i.e. the train() method seems to default to some sort of grid search). Also, caret seems to be using plyr behind the scenes, which messes up some of dplyr methods like summarise. Since I do lots of data manipulation with dplyr that's kind of a problem. Can you help me figure out what the caret's equivalent of the sklearn's model/fit/predict_proba is? Also, is there a way to deal with the plyr/dplyr issue?

theB
  • 6,450
  • 1
  • 28
  • 38
ADJ
  • 4,892
  • 10
  • 50
  • 83

1 Answers1

2

The equivalent of making a prediction in the caret library would be to change the type in ?predict.train. It should be altered to this:

predict(model, data, type="prob")

If you want to mix dplyr/plyr then the easiest way to explicitly call it by using:

dplyr::summarise

or

plyr::summarise

If you had already tried to use predict(..., type="prob") and come up with a weird error which you didn't understand and gave up, I would recommend reading in this thread: Predicting Probabilities for GBM with caret library

Community
  • 1
  • 1