18

The recent addition of direct support for parallel computing in R2.14 sparked a question in my mind. There are numerous options for creating clusters in R. I use snow SOCK clusters on a regular basis, but I know that there are other ways such as MPI. I use SOCK snow clusters because I do not need to install any additional software (I use Fedora 13).

So, my concrete questions:

  1. Is there a gain in performance when using non-SOCK clusters?
  2. Is it easier to create clusters on multiple computers using non-SOCK clusters?
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • In my experience, it's mostly the way you have to write code that makes the difference between packages. I'm not an expert on HPC (I'm sure other will chip in), but I think other types (other than SOCK) are used in different computer architectures. If you have a cluster of computers, you will need an interface to be able to communicate between the nodes. This is where, for example, (Open)MPI comes in. Snowfall vignette has some additional info if you haven't read it yet. – Roman Luštrik Dec 07 '11 at 10:03
  • Thanks for the feedback. I was curious if it is worth investing time into more advanced (?) parallel computing facilities other than snow and SOCK, which work great for me. – Paul Hiemstra Dec 07 '11 at 10:07
  • In its current form, this question does not really fit the SO format (there is no question with a potential 'correct' answer). This will attract close votes as it has done already. However, I would hate to see this topic disappear (it is indeed hard to find information on these topics), so may I invite you to edit your question before it gets closed? – Nick Sabbe Dec 07 '11 at 10:10
  • Hmm, this was actually not something that came out of a practical question. Do you have any suggestion how I can make it more into an SO question? – Paul Hiemstra Dec 07 '11 at 10:19
  • I added some concrete questions which should be answerable. I hope this is enough. – Paul Hiemstra Dec 07 '11 at 10:23
  • Answer of Suminda Sirinath Salpitiko converted to comment: *"Most of R Parallel information can be found searching for `foreach`. There are many more links than I can share here. Search for the phase `R foreach` and this would pull up most information."* – Tomas Mar 24 '14 at 14:37

1 Answers1

8

1) there is a limited number of benchmarks available which proof that MPI will be faster than SOCKets. But as an R user you probably will not care about these differences. They are in the area of milli seconds and the number of communications is not that high in embarrassingly parallel problems

2) Yes, you do not have to provide a list of machine names or IPs. For a computer cluster with 100 nodes this gets complicated. But everything depends on your computer cluster. In most cases MPI or PVM is already preinstalled and everything works out of the box using Rmpi, ...

  • Thanks! In regard to question 2), MPI and PVM might be installed on a preconfigured cluster, but in my case I am interested in creating one ad hoc. When I need it I ask some colleagues if I can borrow some of their cores. In these cases, MPI or such is often not installed. – Paul Hiemstra Dec 07 '11 at 15:21
  • In this case SOCK is probably a simple solution. You could check for Redis to set up a more elegant solution (http://cran.r-project.org/web/packages/doRedis/index.html). It supports cloud resources, too! – Markus Schmidberger Dec 07 '11 at 15:40
  • Assuming Rmpi compiles successfully. :) – Roman Luštrik Dec 08 '11 at 13:29