1

I'm trying to fit a xgbTree model using the train function from the caret package.

EDIT: Here is a sample dataset to make the example reproducable. I've also converted all variables to numeric as suggested:

df<-data.frame(
x1=c(-231,5,-166,-158,170,-243,-184,25,-130,-209,453,-46,-13,-247,-74,-209,-130,-118,10,40),
x2=c(2,48,6,7,24,2,5,7,12,48,48,24,2,8,4,1,8,5,50,6),
x3=c(6, 3, 2, 1, 2, 6, 0, 6, 2, 4, 5, 5, 2, 4, 1, 2, 3, NA, 0, 1),
x4=c(0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0),
x5=c(45.1, 58.6, 41.3, 58.6, 45.1, 60.8, 44.1, 58.6, 38.8, 40.5, 60.8, 45.1, 41.3, 45.1, 41.3, 45.1, 39, 41.3, 51.7, 51.7),
x6=c(0, 2, 4, 0, NA, 0, 1, 0, NA, 0, 3, 0, 0, 0, 0, 0, 0, NA, 0, 0),
x7=c(NA, 6, 6, NA, 6, NA, 3, NA, 6, NA, 6, NA, NA, NA, NA, NA, NA, 1, NA, NA),
x8=c(0, 1, 4, 0, 4, 0, 2, 0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 1, 0, 0),
x9=c(0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
x10=c(NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA),
y=c(0.00272609554964902, 0.00196386488609584, 0.0169606512890095, 0, 0.00978263953223331, 0.00310850075796128, 0.0225595119926366, 0.00456053067993367, 0.00980320074504326, 0.0116718460483506, 0.0618914994405961, 0.0420972062763108, 0.00139303482587065, 0.0426927149151269, 0.0248756218905473, 0, 0, 0.000855672497463542, 0.0287026406429392, 0.00190374657325617))

When I'm using the formula interface everything works fine:

EDIT: used libraries added

library(caret)
library(doParallel)

registerDoParallel(cores=n) 

xgb_model <-train(y ~.,
                     data = df,
                     method = "xgbTree",
                     na.action = na.pass)

But the model training fails when I'm using the non-formula interface:

xgb_model <-train(x=df[,-ncol(df)],
                  y=df[,ncol(df)],
                     data = df,
                     method = "xgbTree",
                     na.action = na.pass)

I've already tried omitting all NA's, as well as using only specific variables to narrow down the problem, but I couldn't really find any issues in regard to the input data.

The actual data.frame looks like this:

'data.frame':   433 obs. of  30 variables:
 $ x1      : int  -231 5 -166 -158 170 -243 -184 25 -130 -209 ...
 $ x2      : int  2 48 6 7 24 2 5 7 12 48 ...
 $ x3      : Ord.factor w/ 7 levels "0"<"1"<"2"<"3"<..: 4 3 2 3 7 1 7 3 5 6 ...
 $ x4      : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 1 2 2 1 2 1 2 1 2 2 ...
 $ x5      : num  45.1 58.6 41.3 58.6 45.1 60.8 44.1 58.6 38.8 40.5 ...
 $ x6      : int  0 2 4 0 NA 0 1 0 NA 0 ...
 $ x7      : Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: NA 6 6 NA 6 NA 3 NA 6 NA ...
 $ x8      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 2 5 1 5 1 3 1 2 1 ...
 $ x9      : Ord.factor w/ 5 levels "0"<"2"<"4"<"6"<..: 1 1 5 1 1 1 1 1 1 1 ...
 $ x10     : int  NA NA NA NA 0 NA 0 NA NA NA ...
 $ x11     : Ord.factor w/ 10 levels "0"<"2"<"4"<"5"<..: 7 5 1 5 4 4 9 7 5 8 ...
 $ x12     : Ord.factor w/ 32 levels "0"<"1"<"2"<"3"<..: 10 2 1 13 1 10 6 6 1 1 ...
 $ x13     : Ord.factor w/ 13 levels "0"<"0.7"<"1.4"<..: 1 1 1 8 1 1 13 6 1 6 ...
 $ x14     : Factor w/ 4 levels "1","2","3","4": 2 1 1 4 1 2 4 1 4 4 ...
 $ x15     : int  1 2 3 1 2 1 1 9 2 2 ...
 $ x16     : int  180 200 160 250 120 160 300 600 180 150 ...
 $ x17     : Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 2 6 5 3 2 2 1 3 2 2 ...
 $ x18     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 1 2 3 5 3 4 4 5 5 ...
 $ x19     : num  366825 509200 353760 502500 306666 ...
 $ x20     : num  2 2 2 2 2.83 ...
 $ x21     : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 2 1 1 1 ...
 $ x22     : int  50 70 32 48 20 56 57 51 53 55 ...
 $ x23     : int  5 2 5 5 2 3 3 2 4 1 ...
 $ x24     : int  0 0 3 0 0 0 0 0 0 0 ...
 $ x25     : int  0 2 0 0 0 0 0 0 0 0 ...
 $ x26     : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
 $ x27     : Ord.factor w/ 5 levels "12"<"13"<"14"<..: NA NA 3 3 1 5 1 5 5 5 ...
 $ x28     : Ord.factor w/ 9 levels "4"<"6"<"7"<"8"<..: 7 7 2 NA 4 6 8 NA 4 9 ...
 $ x29     : num  -0.3211 -0.0462 -0.8133 0.3825 -0.5475 ...
 $ y       : num  0.00273 0.00196 0.01696 0 0.00978 ...
viktor_r
  • 701
  • 1
  • 10
  • 21
  • What exactly is the error you are getting? Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Sharing data from `str()` isn't helpful, perhaps use `dput()` instead. Do you have the variable `y` defined? Does `dataset[,y]` actually return the column of interest for you? – MrFlick Mar 21 '17 at 14:08
  • I've added a reproducable example. The error I'm getting from caret is not very helpful, therefore I haven't added it here. The error is: 'Error in train.default(x = df[, -ncol(df)], y = df[, ncol(df)], : Stopping' – viktor_r Mar 22 '17 at 10:24

3 Answers3

1

Form ?train help referring to the train(x, y, ...) interface:

The predictors in x can be most any object as long as the underlying model fit function can deal with the object class.

The underlying xgb.DMatrix function (you can see all the functions in caret's xgboost wrapper from getModelInfo('xgbTree')) expects a numeric matrix as input, so you get an error. The formula train interface works because it uses model.matrix under the hood to convert your formula into a numeric matrix, including the encoding of factor variables. To use the (x,y) interface, you first have to convert your data.frame into a matrix. Either model.matrix or caret::dummyVars are popular options to help with that.

Of note: you have many ordinal factor variables in your data. Since trees can easily handle any non-uniformity of ordinal intervals, for tree-based non-parametric algorithms, it is better to simply convert each ordinal factor into single-column numeric instead of creating multiple dummy variables out of them.

  • 1
    I've tried converting my data.set to numeric values and also converted the data into a matrix using 'as.matrix(sapply(df, as.numeric))', but I'm getting the same error. – viktor_r Mar 22 '17 at 10:22
1

The data= and na.action= parameter should only be used with the formula version of train(). This means you should either use

xgb_model <- train(y ~.,
                 data = df,
                 method = "xgbTree",
                 na.action = na.pass)

or

xgb_model <- train(x=df[,-ncol(df)],
                  y=df[,ncol(df)],
                  method = "xgbTree")
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

I've found out that the problem was not caused by any mistake in the train() command, but by the attempt to parallelize the model using registerDoParallel(cores=n) from the doParallel package. doParallel works fine with all other models I've tested so far in caret (namely treebag, cforest and gbm).

So the following code works fine, given that you don't use doParallel:

xgb_model <- train(x=df[,-ncol(df)],
                  y=df[,ncol(df)],
                  method = "xgbTree")
viktor_r
  • 701
  • 1
  • 10
  • 21