how to save a ML model in sparklyr?

Question

Consider this simple example, which trains a naive bayes model on some textual data.

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                              features_col = "myvocab", 
                              prediction_col = "pcol",
                              probability_col = "prcol", 
                              raw_prediction_col = "rpcol")
)

The issue is that I fit several models in a loop, get some results, but I would like to be able to save these models in a list (or anything that allows me to use these models separately later on).

I tried with the usual technique: set up an empty list, and add the models to the list as they are created. Unfortunately, this does not work, as illustrated below

model_list <- list()

fitmodel <- function(sc, string){
  print(paste('this is iteration', string))
  model <- ml_fit(pipeline, dtrain_spark)
  model_list[[string]] <- model
  #do some other stuff with the model
}
purrr::map(c('stack', 'over', 'flow'), ~fitmodel(sc,.))
[1] "this is iteration stack"
[1] "this is iteration over"
[1] "this is iteration flow"

however my list is empty! :(

> model_list
list()

What is wrong here? What can be done? I would like to avoid writing to disk if possible.

Thanks!

Possible duplicate of [Update data frame via function doesn't work](https://stackoverflow.com/questions/3969852/update-data-frame-via-function-doesnt-work) — Alper t. Turker, Jun 11 '18 at 22:15

score 2 · Answer 1 · answered Jun 11 '18 at 22:01

2

Don't try to use map for side effects. Rewrite your function as:

strings <- c('stack', 'over', 'flow')

fitmodel <- function(sc, string){
  print(paste('this is iteration', string))
  ml_fit(pipeline, dtrain_spark)
}

model_list <- purrr::map(strings, ~fitmodel(sc,.)) %>% setNames(strings)

answered Jun 11 '18 at 22:01

user9927383

23
2

very nice! do you have any ideas why this fails in my setting? – ℕʘʘḆḽḘ Jun 11 '18 at 22:05
actually now that I think about it, the issue is that fitting the model is only an intermediate step here. That is, my function does other things with the model. This is why I need to store it somewhere inbetween. I cannot use your function as is. Is there any alternative? – ℕʘʘḆḽḘ Jun 11 '18 at 22:30

how to save a ML model in sparklyr?

1 Answers1