I have the following code below. There is a solve method that I call on each and every CmProblem RDD. I pass in a graph in the solve method which actually edits the graph.
Will each task receive a separate instance of the graph?
Will the graph be a shared copy across executors or across individual tasks?
Will each execution of solve be a separate task?
In short will, each call to the solve method receive a new copy of graph(because of serialization at driver and deserialization in a worker node)?
If not how can I achieve a separate copy of graph for all solve method executions? I know I can use Gson to pass a serialized version of the graph and deserialize it in solve method. But is there any other way?
SparkConf conf = new SparkConf().setAppName("xyz").setMaster(sparkMaster);
JavaSparkContext sc = new JavaSparkContext(conf);
List<CmNode> inboundNodes = cmProblem.convertLoadsToNodes(cmProblem.getInboundLoads());
CmGraph graph = new CmGraph(inboundNodes);
List<CmNode> outboundNodes = cmProblem.convertLoadsToNodes(cmProblem.getOutboundLoads());
Objects.requireNonNull(outboundNodes).sort(CmNode::compareTo);
// divide problem
List<CmProblem> cmProblems = getDividedProblems(cmProblem);
JavaRDD<CmProblem> cmProblemJavaRDD = sc.parallelize(cmProblems);
// call solve and merge solution
List<CmSolution> cmSolutions = cmProblemJavaRDD.map(ea -> solve(ea, graph)).collect();
//merge cmSolutions
List<CmPath> paths = new LinkedList<>();
for (CmSolution cmSolution : cmSolutions) {
paths.addAll(CollectionUtils.isNotEmpty(cmSolution.getPaths()) ? cmSolution.getPaths() : new LinkedHashSet<>());
}