0

I have tried to calculating something with Rhadoop(linkage between R and hadoop).

When I benchmarked my cluster with embedded examples in Hadoop-1.0.4, it looked working well. ( I mean all of the cores of slave-nodes worked, though the cpu usage was fluctuating between 50 and 100%)

However, when I applied an example of Rhadoop, it was not the case. (Only one core of each slave-node were activated.)

Is there any configuration that I have to set up in Rhadoop?(just like what I have done with configuration files of hadoop such as core-site.xml)

Thanks

  • 3
    Please make your situation reproducible, i.e. provide us with the data and the code needed to mimic your situation. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for more tips on how to do this. – Paul Hiemstra Mar 11 '13 at 08:50
  • Sorry, I have no permission that you asked for. I will ask him (my co-worker) to give me his code and get a permission to upload it. Thank you for your comment. – Hyunwoong Ji Mar 12 '13 at 04:41

1 Answers1

0

You are probably talking about rmr2, which is part of RHadoop. rmr2 doesn't have specific configuration for this. help(rmr.options) will show you all the configuration options. The number of map tasks and map slots is what determines the degree of parallelism in the map phase. It sounds like you have enough slots. So the number of map tasks could be insufficient. It may depend on the size and other properties of the input. You can pass an additional argument to mapreduce backend.parameters = list(hadoop = list(D = 'mapred.map.tasks')) but hadoop doesn't honor this setting verbatim, just takes it as a hint. The backend.parameters argument is deprecated, but when it's removed some alternate mechanism will be provided for this specific goal. If the problem is in the reduce phase, the cardinality of the set of keys is also important (it sets an upper bound on the degree of parallelism). I concur with Paul that if you had provided a reproducible example my answer would contain much less guesswork. RHadoop has a dedicated forum where devs and users are active https://groups.google.com/forum/?fromgroups=#!forum/rhadoop

piccolbo
  • 1,305
  • 7
  • 17