Suppose that I have the following java code
SparkConf sparkConf = new SparkConf().setAppName("myApp");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);`
JavaRDD<A> firstRDD = sparkContext.parallelize(B, 2);
JavaRDD<A> secondRDD = firstRDD.map(runSomethingAndReturnSomething());
A objectA = secondRDD.collect();
doSomethingWithA(objectA)
I want to run the code in cluster mode so I use spark-submit, start a master and a slave.
As I understand (correct me if I'm wrong) this should happen:
- In the driver (master) start the spark context.
- I say to the master that I want to use the B object in two partitions in parallel.
- The master will send the command (map) to the workers but they still wont executed.
- Finally, I want to make a collect, the workers will start the transformation and start the map command when they finish they will send to the master the results.
- I make something with the results collected in the master.
The issue is that basically the collect is being done in the slave node and not in the master node, why is this happening?