I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like
lapply(my.list, my.function)
where each call to my.function
can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save
and load
to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.
No matter what my.function
does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including loading the necessary packages and everything. save
and load
save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function
on each of the elements of my.list
will have the same semantics locally and remotely.
Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.