0

I have developed a graphlearner with the mlr3 package and I would like to publish it in a Rplumber service. However, when I receive the data to make predictions (data in JSON format), the graphlearner has trouble recognizing the data because the fromJSON function of jsonlite does not infer the right types (on which the graph was learned). Do you have a solution for this ? is there a mechanism to manage JSON data in mlr3 in the prediction phase?

Learning step

library(mlr3)
imp_missind = po("missind")
imp_fct     = po("imputenewlvl", param_vals =list(affect_columns = selector_type("factor")))
imp_num     = po("imputehist", param_vals =list(affect_columns = selector_type("numeric")))
learner = lrn('regr.ranger')
graph = po("copy", 2) %>>% 
  gunion(list(imp_missind, imp_num %>>% imp_fct)) %>>%
  po("featureunion") %>>%
  po(learner)
t1 = tsk("boston_housing")
g1 = GraphLearner$new(graph)
g1$train(t1)
saveRDS(g1,'my-model')

Predction step : it works (simulate data to prediction, remove target col)

data=t1$data()[1:1,-1]
model = readRDS('my-model')
model$predict_newdata(newdata=data)

Predction step : it does not work (simulate JSON data to prediction)

model = readRDS('my-model')
data = t1$data()[1:1,-1]
json = fromJSON(toJSON(data, na="string"))
model$predict_newdata(newdata=json)

and the error :

Erreur : Cannot rbind task: Types do not match for column: cmedv (numeric != integer)

UPDATE reproducible example

library(mlr3learners)
library(mlr3)
library(mlr3pipelines)
library(jsonlite)



imp_missind = po("missind")

imp_fct     = po("imputenewlvl", param_vals =list(affect_columns = selector_type("factor")))

imp_num     = po("imputehist", param_vals =list(affect_columns = selector_type("numeric")))

learner = lrn('regr.ranger')

graph = po("copy", 2) %>>% 
  gunion(list(imp_missind, imp_num %>>% imp_fct)) %>>%
  po("featureunion") %>>%
  po(learner)


task = tsk("boston_housing")


graphlearner = GraphLearner$new(graph)

#train model 
graphlearner$train(task)

# create data to predict  (juste one observation)

data= task$data()
data[1:1, chas := NA]
data = data[1:1,-1]




# look the the types of columns
str(data)

# predictin, this works fine 
predict(graphlearner, data)


# simulate the case when json data is received

json_data = toJSON(data, na="string")

print(json_data)

# get R data from json formatted data
data_from_json = fromJSON(json_data)

# look the types of columns, some are different numeric != integer, factor != char
str(data_from_json)

# try to predict, this does not work, get erro  :    cmedv (numeric != integer)
predict(graphlearner,data_from_json)

Community
  • 1
  • 1
ZchGarinch
  • 295
  • 3
  • 13
  • Same comment as on your last (meanwhile) deleted question: Use the correct tags: 'mlr3' instead of 'mlr'. And do not cross-post here AND on Github. I have deleted the latter. – pat-s Mar 03 '20 at 11:08
  • Provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and avoid `saveRDS()` calls. – pat-s Mar 03 '20 at 11:10
  • I add reproducible example. However, I don't see why I should avoir using the `saveRDS()`. I I want to publish my graphlearner as API, I will have to save it with this function. Would you recommend another good practice? thanks. – ZchGarinch Mar 03 '20 at 11:41
  • Why don't you just convert the column types before predicting? – missuse Mar 03 '20 at 12:32
  • For me, the advantage of using mlr3 is to encapsulate all the pr-processing. This will help us to have generic plumber service. We don't want to introduce preprocessing logic in plumber services. All we have to do, is simply load learner and call predict on receeived json data. – ZchGarinch Mar 03 '20 at 12:39

1 Answers1

4

I think we might want to fix this at some point, but until a fix is available I would suggest to fix the problem by repairing the schema given you saved task$feature_types:

library(mlr3misc)
repair_schema = function(data, feature_types) {
   imap_dtc(data, function(v, k) {
    ft_type = feature_types[id == k,][["type"]]
    if (typeof(v) != ft_type) {
      fn = switch(ft_type,
        "character" = as.character,
        "factor" = as.factor,
        "numeric" = as.numeric,
        "integer" = as.integer
      )
      v = fn(v)
    }
    return(v)
  })
}
data_from_json2 = repair_schema(data_from_json, task$feature_types)
predict(graphlearner,data_from_json2)

This approach would also provide you with more flexibility, as you might encounter a range of encoding problems that can not always be anticipated.

pfistfl
  • 311
  • 1
  • 2