How can I partition pyspark RDDs holding R functions

Question

import rpy2.robjects as robjects

dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()

Outputs

[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]

While the partitioned version results in an error:

dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()

RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>

It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?

EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.

dffunc = sc.parallelize([(0,robjects.r.rnorm),     (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
    import rpy2.robjects as robjects
    return model[1](2)
dffunc.map(loadmodel).collect()

Produces the same error R cannot evaluate code before being initialized.

dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
    import rpy2.robjects as robjects
    import pickle
    return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()

Works just as expected.

I think that problem is more subtle. You need an active R session where data is unpickled. It may or may not be true depending on a timing. Personally I would avoid passing R objects / or views over R objects around. — zero323, Jan 08 '16 at 12:35
@zero323 Perhaps I can combine the use of python pickling in memory - `pickle.dumps`? — retrocookie, Jan 08 '16 at 16:14

lgautier · Accepted Answer · 2016-03-06T15:49:19.613

I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".

What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.

In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like

import rpy2.robjects

or

import rpy2.rinterface
rpy2.rinterface.initr()

before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.

edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).

2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).

Thanks for the superb explanation. I don't quite follow the critical last bit: `and adding an import, or calling rinterface.initr() was a cheap and sufficient fix`. What import are you referring to? Lastly, in which case should I file a bug report? — retrocookie, Jan 08 '16 at 23:13
That part was not the clearest, as you note it. I edited it in the hope to improve it. This is what is happening with your function `loadmodelpickle`. — lgautier, Jan 09 '16 at 00:37
This patch works like a charm. Looking forward to the next release, thanks and great work on the package, the documentation is great. — retrocookie, Jan 11 '16 at 18:35

How can I partition pyspark RDDs holding R functions

1 Answers1

Linked