Apache Spark Correlation only runs on driver

Question

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.

I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

How could I find what part of the correlation has happened at the driver and what at executor?

Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's. Look here for the images from the SparK web UI: Distributed cross correlation matrix computation

Update 2

I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor. submitting the job like this ./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py from master node

My correlation sample file is correlations_example.py:

data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose()) 
print(Statistics.corr(data, method="pearson")) 
sc.stop()

I always get a sequential timeline as :

Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?

Update 3: I tried even adding another executor, still the same seqquential treeAggreagate. I set the spark cluster as mentioned here: http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/

I don't understand your update. So what is the question now ? — eliasah, Jun 29 '17 at 14:24
Have a look: https://stackoverflow.com/questions/42304059/distributed-cross-correlation-matrix-computation — Kumar Roshan Mehta, Jun 29 '17 at 14:25
No, this question is about the Spark implementation of correlation based on looking at the code and finding what happens at the driver and what on the executor. The question I linked is about my experiment. — Kumar Roshan Mehta, Jun 29 '17 at 14:31
Let's say that everything happens on the executor in this matter except the returning result which in breeze Matrix. — eliasah, Jun 29 '17 at 14:39
I have this feeling that it's the treeAggregate this is worrying you — eliasah, Jun 29 '17 at 14:40
Actually yes, does that mean correlation is happening at driver node as correlation is a pairwise computation (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) but it can also be partially computed in a distributed manner(just an idea in mind) — Kumar Roshan Mehta, Jun 29 '17 at 14:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147951/discussion-between-eliasah-and-roshan-mehta). — eliasah, Jun 29 '17 at 14:49

score 0 · Answer 1 · answered Jun 29 '17 at 14:15

0

Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)

This has been answered already. See link below for more details. When does an action not run on the driver in Apache Spark?

answered Jun 29 '17 at 14:15

ganeiy

298
2
9

Updated the question and I'm talking about the cluster and see this https://stackoverflow.com/questions/42304059/distributed-cross-correlation-matrix-computation – Kumar Roshan Mehta Jun 29 '17 at 14:24

Apache Spark Correlation only runs on driver

1 Answers1