5

Following up on Pass rows of a data frame as arguments to a function in R with column names specifying the arguments:

I want to train the following model with different combinations of parameters:

library(xgboost)
library(Matrix)

df <- data.frame(y = sample(0:1, 1000, replace = TRUE),
                 a = rnorm(1000),
                 b = rnorm(1000),
                 c = rnorm(1000),
                 d = rnorm(1000))

train <- sparse.model.matrix(object = y~.-1, data = df)

model <- xgboost(data = train,
                 label = df$y,
                 # parameters
                 nrounds = 10, 
                 subsample = 0.5,
                 colsample_bytree = 0.8)

I created a grid with the parameters and I want to pass the rows of the grid into the xgboost function, while keeping data and label arguments constant.

param <- expand.grid(nrounds = c(10, 50, 100),
                     subsample = c(0.5, 0.8, 0.9),
                     colsample_bytree = c(0.8))

I would like to pass the arguments using the column names to specify them (if the column names is not an option, the order of the columns will do it as well), since this would make the call scalable for different functions.

Community
  • 1
  • 1
D Pinto
  • 871
  • 9
  • 27

2 Answers2

5

I had a similar problem, and looked in vain until I found it in Hadley's Advanced R. This allows you to pass on parameters as they appear in a dataframe, taking the names of columns as arguments. Read here:

https://adv-r.hadley.nz/functionals.html#pmap

So, here it is. There is a solution via purrr::pmap. It maps parameters onto a function:

from Hadley's Advanced R, 8.4.5

This is my own code which I recently used along with quanteda to mess around with the Kaggle SMS Spam dataset. These are the possibilities for my parameters:

tolower <- data_frame(tolower = c(TRUE, FALSE))
stem <- data_frame(stem = c(TRUE, FALSE))
remove_punct <- data_frame(remove_punct = c(TRUE, FALSE))

This is a bonus and not necessary, but I found I needed all of the combinations of my parameters to run a Naive Bayes model. Thanks to Y J via this SO post:

expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
parameters <- expand.grid.df(tolower, stem, remove_punct)

So, now my parameters look like this:

> parameters
  tolower  stem remove_punct
1    TRUE  TRUE         TRUE
2   FALSE  TRUE         TRUE
3    TRUE FALSE         TRUE
4   FALSE FALSE         TRUE
5    TRUE  TRUE        FALSE
6   FALSE  TRUE        FALSE
7    TRUE FALSE        FALSE
8   FALSE FALSE        FALSE

And now for the magic, passing the parameters on to my function of choice (dfm) via pmap:

mymodels <- pmap(parameters, dfm, x = mycorpus)

(x = mycorpus was an extra parameter that is constant, that I want to pass on to dfm)

Here's what I got:

> length(mymodels)
[1] 8
> mymodels[[1]]
Document-feature matrix of: 5,572 documents, 7,714 features (99.8% sparse).

Hope this helps you, or anyone else looking into this method!

Marian Minar
  • 1,344
  • 10
  • 25
2

You can use mapply():

models_list <- mapply(function(x,y,z) xgboost(data = train,
                                              label = df$y,
                                              # parameters
                                              nrounds = x,
                                              subsample = y,
                                              colsample_bytree = z),
                      param$nrounds, param$subsample, param$colsample_bytree, SIMPLIFY = FALSE)

It will give you a list of all your models:

>models_list[[1]]
##### xgb.Booster
raw: 25.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, subsample = ..1, colsample_bytree = ..2)
params (as set within xgb.train):
  subsample = "0.5", colsample_bytree = "0.8", silent = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
  cb.save.model(save_period = save_period, save_name = save_name)
niter: 10
evaluation_log:
    iter train_rmse
       1   0.487354
       2   0.473657
---                
       9   0.419176
      10   0.412587
LAP
  • 6,605
  • 2
  • 15
  • 28
  • 1
    this a great answer but doesn't scale in the sense that if I add a new parameter to the grid I would need to change the function call in two different places. The perfect (although maybe not feasible) is to pass the arguments by column name or order. – D Pinto Feb 08 '17 at 12:25
  • From the top of my head, I can't think of a way to implement this without writing a custom function and predefining your parameters with a `for`-loop. Maybe someone else has an idea. – LAP Feb 08 '17 at 12:45