I want to persist a caret trained random forest model to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/readRDS, but I'd like to have an ASCII file instead of a binary file. I'd like the file to be minimal but still usable for predictions. Something similar to this but for rf instead of lm. Thanks
-
*"reload it in another program"* is a bit vague. Do you mean "read in another R session?" (Deeper question: on what is your preference for ASCII storage based? My preferences for such are based on either version-control or easily verifying components of it in a non-R process ... and using `dput` in the way that answer suggested is compatible with neither of those, so I do not see the advantage.) – r2evans Sep 05 '18 at 18:09
-
the other program is power bi. it could read the model as a text file and then rebuild it within an r visual. i did it successfully with a lm model. – elfersi Sep 05 '18 at 19:18
-
When you tried the similar technique (`dput(..., control=c(...), file=...)`) for `rf` models, did it give an error, known-incorrect results, or did something else happen? – r2evans Sep 05 '18 at 21:37
-
it worked but the file is very large. i'd like to trim it to its minimum attributes. see lm article: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ – elfersi Sep 06 '18 at 00:04
-
I don't know the code you're using the model with, but ... since I don't know exactly what components are required, I'd find the largest and remove it, and see if things still work, then repeat until you achieve an acceptable size. I expect this is not what you wanted or hoped to hear, but this isn't a common use-case, so I don't know that you'll find somebody with a quick answer for this. (Most people, if they need this, will use the binary storage.) – r2evans Sep 06 '18 at 00:23
-
What is the binary storage? – elfersi Sep 06 '18 at 01:18
-
`saveRDS` or just `save` provides binary (R-specific) storage. It's quite fast for reading and using, just not human-readable or (as far as I know) usable by anything not "R". – r2evans Sep 06 '18 at 03:39
-
That won't work. I need something that I can use with power bi – elfersi Sep 06 '18 at 12:22
-
Then how do you expect to get Power BI to read your model? The output from `dput` is about as proprietary R as you can get in a textual representation, so if you can't read in an `.rda` or `.rds` file, then how will you read in a textual one? We come back to the question of "the other program", where you need to know what format is required for it to absorb "a model". If you can use arbitrary R code, and `source` a dput-file to get the model, why can't you use `readRDS` or `load`? (I really don't know ... do you?) – r2evans Sep 06 '18 at 15:40
-
I use dput to generate a text file. I load this text file into power bi. I load the text file into a power bi r visual as a string. Within the visual I use eval(parse(text=... to rebuild the model from the string. I'm happy to use anything else that will allow me to rebuild a model within a power bi r visual – elfersi Sep 06 '18 at 16:05
-
I'm trying to help, elfersi, but you're being a little vague/unclear. How do you load this text file into power bi? Is it via an R script, or is there something R-native about Power BI (despite the statement that [Power BI needs R installed externally](https://learn.microsoft.com/en-us/power-bi/desktop-r-in-query-editor)) that enables it to know how to read R's proprietary `dput` format? – r2evans Sep 06 '18 at 16:07
-
I load the text file like you'd load any text or csv file as a data source in power bi – elfersi Sep 06 '18 at 16:09
-
I don't know Power BI to know how it can parse the `dput` output correctly, since it does not at all resemble CSV, TSV, fixed-width, or anything else textual/structured. This question might be easier to answer if you include an example of what you need exported from the model and in exactly what text format; your [link](https://stackoverflow.com/q/41645120/3358272) only suggests `dput` which is clearly not CSV. If I completely miss Power BI's ability to grok R code without R then it must be me, but this is just not clear enough for me to provide any more assistance. – r2evans Sep 06 '18 at 16:16
-
`dput(mtcars)` starts with `structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, ...), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, ...), ...), row.names = c("Mazda RX4", "Mazda RX4 Wag", ...), class = "data.frame")`. It would be impressive if Power BI (without R installed) can correctly parse that data like any other text or csv file. But again, I'm not familiar enough with Power BI to know that it cannot. I apologize for wasting your time on a `[powerbi]` tag I'm not qualified to answer. – r2evans Sep 06 '18 at 16:18
-
In my pipeline, power bi reads it 'as-is'. That string is then loaded into a power bi r visual (which is r embedded). It get parsed only within that visual – elfersi Sep 06 '18 at 16:22
1 Answers
Here's my one shot: if you can only pass text arguments and not binary structures (such as found in models saved as .rda
or .rds
files), I wonder if you can pass the base64 encoded representation of an object:
mdl <- lm(mpg ~ disp + cyl, data=mtcars)
saveRDS(mdl, file="model.rds")
That's the binary file I mentioned earlier. Since you are unable to read that into Power BI, let's textually encode it. I'm using base64enc
here, but there are likely other ways that might be more efficient, more compact, etc ... I'm not making that claim here.
library(base64enc)
writeLines(base64encode("model.rds"), con="model.rds.b64")
tf <- tempfile()
This tf
object will be cleaned up in the normal "temp file cleanup" method for Power BI and/or your OS. This next command uses file=
, but it can just as easily be passed a character
vector (of length 1, I believe), in the case that your R code is given this object via another method:
base64decode(file="model.rds.b64", output=tf)
mdl2 <- readRDS(tf)
mdl
# Call:
# lm(formula = mpg ~ disp + cyl, data = mtcars)
# Coefficients:
# (Intercept) disp cyl
# 34.66099 -0.02058 -1.58728
identical(mdl, mdl2)
# [1] TRUE
And though this is lm
and not rf
, it's fairly compact:
file.info("model.rds")$size # same as "tf"
# [1] 2637
file.info("model.rds.b64")$size
# [1] 3518
(Not surprisingly, the base64 encoding introduces a 33% increase here, which is expected.)

- 141,215
- 6
- 77
- 149
-
thanks @r2evans. any size below 80,000 is definitely fine. however, when it comes to rf, files are huge. in my particular instance, rds: 853,707 / b64: 1,138,278 / dput: 4,502,299. hence my initial question: is there a way to "trim the fat" from a rf model (before saving it) and still be able to reload it properly? see how it's done for lm: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ – elfersi Sep 06 '18 at 17:26
-
I see. Then I can only return to [my second comment above](https://stackoverflow.com/questions/52191025/how-to-correctly-dput-a-fitted-random-forest-regression-with-caret-to-an-asc/52208725?noredirect=1#comment91339401_52191025) where I suggest you need to try removing individual components. Something like `sort(sapply(mdl, object.size))` will give you an ordered list of component sizes; if you cannot remove the top 3 by-size without impacting its usability, then you may not have much wiggle room. – r2evans Sep 06 '18 at 17:43