1

Having walked through several tutorials, I have managed to make a script that successfully uses XGBoost to predict categorial prices on the Boston housing dataset.

However, I cannot successfully tune the parameters of the model using CV. Even after trying several solutions from tutorials and postings here on stackowerflow.

My best outcome so far is very 'hacky' and only tunes a single parameter:

steps <- seq(75,90,5)/100
    for(i in steps){
    .....
    }

But I see all of these fancy setups, that run through several parameters automatically using MLR or Caret or NMOF. However, I haven't gotten close to getting anyone to work on these data. I suspect that it is because most are set up for binary classification but even when addressing this as best, I can I have no success. I could provide you with hundreds of line of code that does not work, but I think the easiest is to provide my code as far as it works here and hear you out how you would progress from here rather than getting swamped in my poor code.

Edit: As I have not had any success even running other peoples scripts. Here are some additional details:

 > packageVersion("mlr")
 ‘2.11’

 > packageVersion("xgboost")
 ‘0.6.4.1’
  • you can find a short answer here of how to tune xgboost parameters. https://stackoverflow.com/questions/34469038/understanding-python-xgboost-cv/47795090#47795090 – Eran Moshe Mar 01 '18 at 11:26
  • Thank you for your answer. If I read this correctly, it is a python rather than R package. I will see if I can work it out, though. – Simon Hviid Del Pin Mar 01 '18 at 11:29
  • May be, this tutorial. I wrote it almost an year ago using MLR package: https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/ – YOLO Mar 01 '18 at 11:30
  • 1
    Maybe you get better results using mlrMBO (see https://github.com/mlr-org/mlrMBO)? @ManishSaraswat already posted a good tutorial of how to tune with mlr. Another short code-example can be found here https://www.kaggle.com/casalicchio/tuning-with-mlr . – Giuseppe Mar 01 '18 at 11:40
  • Thank you for your response, Manish. I have been working with that exact tutorial and it was very helpful but I encountered an error that I could not overcome. Maybe, you have some idea what is wrong? I get the first error at line 95, running traintask <- makeClassifTask (data = train,target = "target"): Warning in makeTask(type = type, data = data, weights = weights, blocking = blocking, : Provided data is not a pure data.frame but from class data.table, hence it will be converted. [code](https://pastebin.com/NPjvP5Ju) – Simon Hviid Del Pin Mar 01 '18 at 11:47
  • Giuseppe thanks for your response. When I run the script verbatim, I get the following error: '00001: Error in checkLearnerBeforeTrain(task, learner, weights) : Task 'train' has factor inputs in 'season, holiday, workingday, weather, hour, wee...', but learner 'regr.xgboost' does not support that!' – Simon Hviid Del Pin Mar 01 '18 at 12:07
  • Convert them to numeric variables. You can make several binary variables out of one categorical variable indicating if a specific category is there or not. – PhilippPro Mar 01 '18 at 12:18
  • But why is there specifically a script that converts the features into factors, if this is not supported by xgboost? Apparantly even the outcome variable is a factor in this case (as far as I understand it correctly) – Simon Hviid Del Pin Mar 01 '18 at 12:43
  • @Simon Hviid Del Pin could you provide details on your data set? How many features, how many categorical features, what is the target variable, what are the dimensions of the train set? I find it very odd you haven't found any code that works for you, xgboost is super popular and there are hundreds of tutorials. If none of them works what is the chance some code anyone posts here would. The problem is most likely in your data set and not in the tutorials, unfortunately no one can help without seeing the data. – missuse Mar 01 '18 at 13:17
  • @missuse I can not get this parameter tuning to work currently on ANY data set or with any code I have thus far tried. The data that I am trying to work out is the Boston {MASS} dataset in which I am trying to predict one of 3 arbitrary price ranges ("cheap", "medium", "high"). Can you not reproduce everything from the code posted in my main post? – Simon Hviid Del Pin Mar 01 '18 at 14:09

1 Answers1

3

At first, update mlr and other required packages. Then consider the quickstart example from the mlr cheatsheet:

library(mlr)
#> Loading required package: ParamHelpers
library(mlbench)
data(Soybean)

set.seed(180715)

soy = createDummyFeatures(Soybean, target = "Class")
tsk = makeClassifTask(data = soy, target = "Class")
ho = makeResampleInstance("Holdout", tsk)
tsk.train = subsetTask(tsk, ho$train.inds[[1]])
tsk.test = subsetTask(tsk, ho$test.inds[[1]])

lrn = makeLearner("classif.xgboost", nrounds=10)
#> Warning in makeParam(id = id, type = "numeric", learner.param = TRUE, lower = lower, : NA used as a default value for learner parameter missing.
#> ParamHelpers uses NA as a special value for dependent parameters.

cv = makeResampleDesc("CV", iters=5)
res = resample(lrn, tsk.train, cv, acc)
#> Resampling: cross-validation
#> Measures:             acc
#> [Resample] iter 1:    0.9010989
#> [Resample] iter 2:    0.9230769
#> [Resample] iter 3:    0.9120879
#> [Resample] iter 4:    0.9230769
#> [Resample] iter 5:    0.9450549
#> 
#> Aggregated Result: acc.test.mean=0.9208791
#> 

# Tune hyperparameters
ps = makeParamSet(makeNumericParam("eta", 0, 1),
                  makeNumericParam("lambda", 0, 200),
                  makeIntegerParam("max_depth", 1, 20)
)
tc = makeTuneControlMBO(budget = 100)
tr = tuneParams(lrn, tsk.train, cv5, acc, ps, tc)
#> [Tune] Started tuning learner classif.xgboost for parameter set:
#>              Type len Def   Constr Req Tunable Trafo
#> eta       numeric   -   -   0 to 1   -    TRUE     -
#> lambda    numeric   -   - 0 to 200   -    TRUE     -
#> max_depth integer   -   -  1 to 20   -    TRUE     -
#> With control class: TuneControlMBO
#> Imputation value: -0
#> [Tune-x] 1: eta=0.529; lambda=194; max_depth=18
#> [Tune-y] 1: acc.test.mean=0.7846154; time: 0.0 min

# /... output truncated .../

#> [Tune-x] 100: eta=0.326; lambda=0.0144; max_depth=19
#> [Tune-y] 100: acc.test.mean=0.9340659; time: 0.0 min
#> [Tune] Result: eta=0.325; lambda=0.00346; max_depth=20 : acc.test.mean=0.9450549

lrn = setHyperPars(lrn, par.vals = tr$x)

# Evaluate performance
mdl = train(lrn, tsk.train)
prd = predict(mdl, tsk.test)

# Final model
mdl = train(lrn, tsk)

More explanations in the cheatsheet (use the .pptx version, if you want to copy not only the code, but also the descriptions).

GegznaV
  • 4,938
  • 4
  • 23
  • 43