How to correctly `dput` a fitted random forest regression (with caret) to an ASCII file and recreate it later?

Question

I want to persist a caret trained random forest model to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/readRDS, but I'd like to have an ASCII file instead of a binary file. I'd like the file to be minimal but still usable for predictions. Something similar to this but for rf instead of lm. Thanks

*"reload it in another program"* is a bit vague. Do you mean "read in another R session?" (Deeper question: on what is your preference for ASCII storage based? My preferences for such are based on either version-control or easily verifying components of it in a non-R process ... and using `dput` in the way that answer suggested is compatible with neither of those, so I do not see the advantage.) — r2evans, Sep 05 '18 at 18:09
the other program is power bi. it could read the model as a text file and then rebuild it within an r visual. i did it successfully with a lm model. — elfersi, Sep 05 '18 at 19:18
When you tried the similar technique (`dput(..., control=c(...), file=...)`) for `rf` models, did it give an error, known-incorrect results, or did something else happen? — r2evans, Sep 05 '18 at 21:37
it worked but the file is very large. i'd like to trim it to its minimum attributes. see lm article: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ — elfersi, Sep 06 '18 at 00:04
I don't know the code you're using the model with, but ... since I don't know exactly what components are required, I'd find the largest and remove it, and see if things still work, then repeat until you achieve an acceptable size. I expect this is not what you wanted or hoped to hear, but this isn't a common use-case, so I don't know that you'll find somebody with a quick answer for this. (Most people, if they need this, will use the binary storage.) — r2evans, Sep 06 '18 at 00:23
`saveRDS` or just `save` provides binary (R-specific) storage. It's quite fast for reading and using, just not human-readable or (as far as I know) usable by anything not "R". — r2evans, Sep 06 '18 at 03:39
That won't work. I need something that I can use with power bi — elfersi, Sep 06 '18 at 12:22
Then how do you expect to get Power BI to read your model? The output from `dput` is about as proprietary R as you can get in a textual representation, so if you can't read in an `.rda` or `.rds` file, then how will you read in a textual one? We come back to the question of "the other program", where you need to know what format is required for it to absorb "a model". If you can use arbitrary R code, and `source` a dput-file to get the model, why can't you use `readRDS` or `load`? (I really don't know ... do you?) — r2evans, Sep 06 '18 at 15:40
I use dput to generate a text file. I load this text file into power bi. I load the text file into a power bi r visual as a string. Within the visual I use eval(parse(text=... to rebuild the model from the string. I'm happy to use anything else that will allow me to rebuild a model within a power bi r visual — elfersi, Sep 06 '18 at 16:05
I'm trying to help, elfersi, but you're being a little vague/unclear. How do you load this text file into power bi? Is it via an R script, or is there something R-native about Power BI (despite the statement that [Power BI needs R installed externally](https://learn.microsoft.com/en-us/power-bi/desktop-r-in-query-editor)) that enables it to know how to read R's proprietary `dput` format? — r2evans, Sep 06 '18 at 16:07
I load the text file like you'd load any text or csv file as a data source in power bi — elfersi, Sep 06 '18 at 16:09
I don't know Power BI to know how it can parse the `dput` output correctly, since it does not at all resemble CSV, TSV, fixed-width, or anything else textual/structured. This question might be easier to answer if you include an example of what you need exported from the model and in exactly what text format; your [link](https://stackoverflow.com/q/41645120/3358272) only suggests `dput` which is clearly not CSV. If I completely miss Power BI's ability to grok R code without R then it must be me, but this is just not clear enough for me to provide any more assistance. — r2evans, Sep 06 '18 at 16:16
`dput(mtcars)` starts with `structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, ...), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, ...), ...), row.names = c("Mazda RX4", "Mazda RX4 Wag", ...), class = "data.frame")`. It would be impressive if Power BI (without R installed) can correctly parse that data like any other text or csv file. But again, I'm not familiar enough with Power BI to know that it cannot. I apologize for wasting your time on a `[powerbi]` tag I'm not qualified to answer. — r2evans, Sep 06 '18 at 16:18
In my pipeline, power bi reads it 'as-is'. That string is then loaded into a power bi r visual (which is r embedded). It get parsed only within that visual — elfersi, Sep 06 '18 at 16:22

score 1 · Answer 1 · answered Sep 06 '18 at 16:40

Here's my one shot: if you can only pass text arguments and not binary structures (such as found in models saved as .rda or .rds files), I wonder if you can pass the base64 encoded representation of an object:

mdl <- lm(mpg ~ disp + cyl, data=mtcars)
saveRDS(mdl, file="model.rds")

That's the binary file I mentioned earlier. Since you are unable to read that into Power BI, let's textually encode it. I'm using base64enc here, but there are likely other ways that might be more efficient, more compact, etc ... I'm not making that claim here.

library(base64enc)
writeLines(base64encode("model.rds"), con="model.rds.b64")
tf <- tempfile()

This tf object will be cleaned up in the normal "temp file cleanup" method for Power BI and/or your OS. This next command uses file=, but it can just as easily be passed a character vector (of length 1, I believe), in the case that your R code is given this object via another method:

base64decode(file="model.rds.b64", output=tf)
mdl2 <- readRDS(tf)
mdl
# Call:
# lm(formula = mpg ~ disp + cyl, data = mtcars)
# Coefficients:
# (Intercept)         disp          cyl  
#    34.66099     -0.02058     -1.58728  
identical(mdl, mdl2)
# [1] TRUE

And though this is lm and not rf, it's fairly compact:

file.info("model.rds")$size # same as "tf"
# [1] 2637
file.info("model.rds.b64")$size
# [1] 3518

(Not surprisingly, the base64 encoding introduces a 33% increase here, which is expected.)

thanks @r2evans. any size below 80,000 is definitely fine. however, when it comes to rf, files are huge. in my particular instance, rds: 853,707 / b64: 1,138,278 / dput: 4,502,299. hence my initial question: is there a way to "trim the fat" from a rf model (before saving it) and still be able to reload it properly? see how it's done for lm: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ — elfersi, Sep 06 '18 at 17:26
I see. Then I can only return to [my second comment above](https://stackoverflow.com/questions/52191025/how-to-correctly-dput-a-fitted-random-forest-regression-with-caret-to-an-asc/52208725?noredirect=1#comment91339401_52191025) where I suggest you need to try removing individual components. Something like `sort(sapply(mdl, object.size))` will give you an ordered list of component sizes; if you cannot remove the top 3 by-size without impacting its usability, then you may not have much wiggle room. — r2evans, Sep 06 '18 at 17:43

How to correctly `dput` a fitted random forest regression (with caret) to an ASCII file and recreate it later?

1 Answers1