1

Having recently completed Datacamp's course "Machine Learning toolbox" I wanted to apply something I learned: caret can input missing values using argument preProcess = "medianImpute"

If I run table(complete.cases(df)) I get:

FALSE  TRUE 
24429  6042

So I'll need to do something with missing values. The video made it look so simple!

mod.lm.medians <- train(target ~., 
                data = train, 
                trControl = train_control,
                method = "lm",
                preProcess = "medianImpute")

Gives:

Error in na.fail.default(list(target = c(5850000L, 6000000L, 5700000L, : missing values in object

I found another SO answer here which told me t try na.action=na.exclude which lets my model run but only on the complete cases, which is not what I want.

Is my understanding of caret's preprocess parameter incorrect? I expected that missing values would be replaced with the median for the feature for each observation in df. Instead I got this error.

Community
  • 1
  • 1
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • To investigate you could run the preProcess separately with something like `predict(preProcess(train, method=c("medianImpute")), train)` the reason for the two steps is that the `preProcess `i s learned from the train set but also needs to be applied to the test set. Here we just reapply to train to see the effect – Andrew Lavers Apr 30 '17 at 03:18
  • Hi, I typed that into the console and it runs. I don't follow though! Please ELI5 and use crayons and lego if possible. I.e. I don't know what I just ran. How could I apply this to my model? – Doug Fir Apr 30 '17 at 03:27
  • Here is [a pretty good article on caret pre processing](http://machinelearningmastery.com/pre-process-your-dataset-in-r/). It is a little repetitive for each kind of transformation. – Andrew Lavers Apr 30 '17 at 03:34
  • Thanks for the link. I actually stumbled across it while googling for solutions. Will read through. Hoping someone that recognises this caret error sees this too and can comment – Doug Fir Apr 30 '17 at 03:56
  • I don't know if you found any solution to this question. I was trying the exact same thing (datacamp) with practicing on a real dataset and got the same error. – User2321 Oct 17 '17 at 20:55
  • @User2321 it was a while ago now but if my memory serves what I ended up doing was creating a separate training data set before training and cross validation then trained with that. So I applied medians to missing values manually before passing to caret. Not a solution but a work around. If you figure it out please do share – Doug Fir Oct 18 '17 at 14:08

2 Answers2

0

I am having the exact same issue. I took the datacamp course called "Machine Learning Toolbox" and it was working on their console, but it does not work on mine.

BData
  • 147
  • 2
  • 8
  • Let me know if you figure it out. As I recall I imputed the medians before passing to caret (which is actually more efficient if you are experimenting with several algorithms, since don't want r to impute the medians each time it moves on to the next algorithm). Not sure if it will work but in the past I found switching the caret formula interface sometimes magically removes issues ```target ~ var1, var2``` to ```x = df$target, y = select(df, var1, var2)```. Not sure if that will make a difference or not though – Doug Fir Jan 04 '18 at 15:54
0

I believe the issue is with using a formula the preProcess argument. try breaking out the preprocess first and then train...

# first preprocess
preproc <- preProcess(train, method = "medianImpute")
trainPreProc<- predict(preproc,train)

mod.lm.medians <- train(target ~., 
                data = train, 
                trControl = train_control,
                method = "lm",
                preProcess = "medianImpute")
Brian
  • 848
  • 10
  • 32