Spark Java Closure details

Question

I have the following code below. There is a solve method that I call on each and every CmProblem RDD. I pass in a graph in the solve method which actually edits the graph.

Will each task receive a separate instance of the graph?

Will the graph be a shared copy across executors or across individual tasks?

Will each execution of solve be a separate task?

In short will, each call to the solve method receive a new copy of graph(because of serialization at driver and deserialization in a worker node)?

If not how can I achieve a separate copy of graph for all solve method executions? I know I can use Gson to pass a serialized version of the graph and deserialize it in solve method. But is there any other way?

    SparkConf conf = new SparkConf().setAppName("xyz").setMaster(sparkMaster);
    JavaSparkContext sc = new JavaSparkContext(conf);

    List<CmNode> inboundNodes = cmProblem.convertLoadsToNodes(cmProblem.getInboundLoads());

    CmGraph graph = new CmGraph(inboundNodes);

    List<CmNode> outboundNodes = cmProblem.convertLoadsToNodes(cmProblem.getOutboundLoads());
    Objects.requireNonNull(outboundNodes).sort(CmNode::compareTo);


    // divide problem
    List<CmProblem> cmProblems = getDividedProblems(cmProblem);
    JavaRDD<CmProblem> cmProblemJavaRDD = sc.parallelize(cmProblems);

    // call solve and merge solution
    List<CmSolution> cmSolutions = cmProblemJavaRDD.map(ea -> solve(ea, graph)).collect();


    //merge cmSolutions
    List<CmPath> paths = new LinkedList<>();

    for (CmSolution cmSolution : cmSolutions) {
        paths.addAll(CollectionUtils.isNotEmpty(cmSolution.getPaths()) ? cmSolution.getPaths() : new LinkedHashSet<>());
    }

score 1 · Answer 1 · answered Sep 07 '18 at 04:38

Will each task receive a separate instance of the graph?

As a matter of fact, If you have a local variable and your job would be executed on a distributed environment then it's definite each task would have its own version of that local variable. Moreover, if the local var is an object of a custom class then that must be serializable and of course in the jar file which is going to be submitted. In other words, your graph variable would be sent to each executor and would be used for each task separately.

Will each execution of solve be a separate task?

the number of tasks depends on the RDD's number of partitions. In other words, you would have multiple calls of your solve method for each task.

For more information you can google and also use this link: How are stages split into tasks in Spark?

Spark Java Closure details

1 Answers1