Lets say I have a large number of graph files and each graph has around 500K edges. I have been processing these graph files on Apache Spark and I was wondering how to parallelize the entire graph processing job efficiently. Since for now, every graph file is independent with any other, I am looking for parallelism with the files. So, if I have 100 graph files and I have 20 nodes clusters, can I process each file on each node, so each node will process 5 files. Now, what is happening is like the single graph is being processed in number of stages which is causing a lot of shuffling.
graphFile = "/mnt/bucket/edges" #This directory has 100 graph files each file with around 500K edges
nodeFile = "/mnt/bucket/nodes" #This directory has node files
graphData = sc.textFile(graphFile).map(lambda line: line.split(" ")).flatMap(lambda edge: [(int(edge[0]),int(edge[1]))])
graphDataFrame = sqlContext.createDataFrame(graphData, ['src', 'dst']).withColumn("relationship", lit('edges')) # Dataframe created so as to work with Graphframes
nodeData = sc.textFile(nodeFile).map(lambda line: line.split("\s")).flatMap(lambda edge: [(int(edge[0]),)])
nodeDataFrame = sqlContext.createDataFrame(nodeData, ['id'])
graphGraphFrame = GraphFrame(nodeDataFrame, graphDataFrame)
connectedComponent = graphGraphFrame.connectedComponents()
The thing is its taking a lot of time to process even couple of files. And I have to process like 20K files. Each file has 800K edges. May be if data partition strategy can be figured out that ensures every dependent edges will be processed on single node, shuffling will be less.
Or what is the best way of solving this efficiently ?