How to use sparkR to do parallel computing on different clusters?

Question

I have a R script running in local: each record/row is feed into a function called func to perform some calculation. so the flow is as follows.

 new <- lapply(old, func)

Ideally, using sparkR, I would expect each worker has the function func and perform calculation on a subset of "old". In this case, func is very simple and can be calculated locally (no need a distributed version of this func).

Any one know how to achieve this in using SparkR? Basically the question is if there is any support in SparkR functioning like doparallel but on multiple workers.

basically, the question is on "is there any support in SparkR functioning like doparallel but on multiple workers" — HappyCoding, Jan 26 '16 at 02:51
Do you want to do a map? If so, see http://stackoverflow.com/questions/31012765/how-to-do-map-and-reduce-in-sparkr — Jan van der Laan, Jan 26 '16 at 09:29

score 0 · Answer 1 · answered Jan 26 '16 at 22:16

0

Parallel functions similar to doParallel on SparkR are being developed, but aren't available yet in 1.6.0

https://issues.apache.org/jira/browse/SPARK-7264

Another option may be to use UDFs in SparkR, which are also being developed currently and not available yet.

https://issues.apache.org/jira/browse/SPARK-6817

answered Jan 26 '16 at 22:16

xyzzy

319
1
7

thanks for the reply. feel surprise that sparkR doesn't have this available so far. looking forward for the new release. but, I'll keep this post open for a while in case there are some one that happens to know some alternative to solve the problem. thanks :) – HappyCoding Jan 27 '16 at 01:31
check https://github.com/amplab-extras/SparkR-pkg and https://amplab-extras.github.io/SparkR-pkg/. seems original sparkR in amplab-extras can support RDDs as distributed. – HappyCoding Jan 27 '16 at 03:14
@HappyCoding Problem is not support itself but how performance and robustness. Thats why RDD API is not included in SparkR since the official release. – zero323 Jan 27 '16 at 13:51

How to use sparkR to do parallel computing on different clusters?

1 Answers1