4

I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like

lapply(my.list, my.function)

where each call to my.function can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save and load to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.

No matter what my.function does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including loading the necessary packages and everything. save and load save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function on each of the elements of my.list will have the same semantics locally and remotely.

Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
piccolbo
  • 1,305
  • 7
  • 17
  • My first pass at an answer was to assert that this it's not feasible as you request, but I think there is a feasible answer. What about `RHIPE`? It is satisfactory and effective. – Iterator Oct 06 '11 at 15:22
  • Glad RHIPE is working for you, but it doesn't deal with environments at all so it's not relevant to this question. With rmr we are trying to make mapreduce work like a lapply-tapply combination, with normal variable scoping in effect. We just think that's the most R-like way to do it and that programs written this way are simple and beautiful. Of course beauty is highly debatable and history will be the judge. But the question is about restoring environments, not whether RHIPE is better than rmr. If you have any arguments supporting the unfeasibility of my request I'd be interested. – piccolbo Oct 10 '11 at 19:09
  • Sorry, for some reason, I misread your question - I was assuming that you were using `rmr`, rather than developing it. In any case, this will be very challenging. Environments, as you note, are tricky. In my case, I address them via functions that identify and save environments. Much harder to deal with are any objects that interact with the OS (e.g. memory mapped files, connections), rather than exist solely within R. You might also check the loading order of packages, to be sure that the masking is reproduced. – Iterator Oct 11 '11 at 02:02
  • (Continued) Regarding infeasibility - having banged my head against the wall in trying to do reproducible statistics, especially on a grid, it's not provably infeasible (at least I can't prove it ;-)), but reproducing an entire R setup across a heterogeneous grid (as most grids eventually become unless VMs are used) is a serious effort. I'll be willing to guinea pig anything put out, since it will help me, but I can't say that there's any obvious solutions yet. The best I've found is to emphasize reproduceability in all primary and auxiliary code & scripts. – Iterator Oct 11 '11 at 02:07
  • Follow-up: can you provide a bit of code to demonstrate the problem to be solved? I am still mulling over this problem. While reproducing a local instance on many Hadoop nodes is hard, solving the issue of saving environments is a reasonably concise sub-problem. – Iterator Oct 11 '11 at 14:47

2 Answers2

4

Typically save and load should work just as you want: when a function is saved (actually, it's a "closure" that gets saved), the environment where it was defined is also saved. If that function was defined as part of a package, a reference to that package is saved instead, and the package is loaded back in again when load sees the reference. (You get a warning when saving if the package did not have a namespace).

The only problem should be the global environment. There, a reference is also saved but this will not save all the variables in the global environment, so you'd have to save them explicitly.

Other environments are saved including their content, and then recursively the parent environment is also saved (unless its a package or globalenv as described above).

Note that saveRDS and serialize alternatives provides a little more control: you get to provide a refhook function that is called whenever an environment is saved. You then do whatever you want to store the environment and return a string id. When loading, a similar refhook is called upon to recreate the environment from that string id. However, you still do not get called for saving the global environment.

e <- new.env() # parent is global env
e$foo <- 42
ee <- new.env(parent=e)
ee$bar <- 13
f <- local(function() foo+bar, ee) 
f() # foo+bar = 55
b <- serialize(f, NULL) # Gives you the serialized bytes

g <- unserialize(b) # Loads from the bytes
g() # 55
# It created new environments...
!identical(environment(g), environment(f))

Hope this helps a bit.

Good luck with rmr!

Iterator
  • 20,250
  • 12
  • 75
  • 111
Tommy
  • 39,997
  • 12
  • 90
  • 85
  • Great answer, like your answer to Joshua's question. Do you have any other pointers on explanations on environments that you've posted on SO? Your answers have been quite enlightening to me. (I'm merely posting what info I can find, though you clearly have a better understanding of how to manage environments and closures.) – Iterator Nov 01 '11 at 02:00
  • @Iterator - Thank you! nice to hear that someone appreciates my ramblings :) I don't remember writing much more about environments elsewhere, but you could always post a question ;) – Tommy Nov 01 '11 at 16:06
  • That's the problem - I don't yet know what to ask. :) – Iterator Nov 01 '11 at 16:26
  • Hi Tommy, you may have pointed in an interesting direction. The problem is not as I thought that save is not recursive but the different treatment of the global environment. Some Google guys that have worked on a project similar to rmr told me that treatment of the global environment was the biggest issue. You should come and help with rmr, more so because of your interest in high performance R! – piccolbo Jan 14 '12 at 22:11
  • @piccolbo Thanks for the invite, but currently I'm busy with other projects :) – Tommy Jan 15 '12 at 04:29
1

After thinking about this question a bit further, it seems that the answers may be useful to your problem. If you are having some of the same problems in saving environments as the OP, then Gabor's answer is probably going to help you get on track. However, if basic serialization and saving of environments is the problem, my (admittedly less sophisticated) answer might help - convert to lists via as.list() and then serialize that in the usual way, or consider serialization via JSON; my favorite such package for that is RJSONIO.

Tommy's answer, however, is much more informative about what's going on. Assuming you will be investigating these issues extensively, especially their serialization, I also recommend looking at Tommy's other excellent insights in this answer to a question on environments, closures, and frames.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111