I have a parent Graph that I want to filter into multiple subgraphs, so I can apply a function to each subgraph and extract some data. My code looks like this:
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = ...
val myEdges = ...
val myGraph = Graph(myVertices, myEdges)
val myResults : RDD[(<Tuple>)] = myTerms.map { x => mySubgraphFunction(myGraph, x) }
Where mySubgraphFunction is a function that creates a subgraph, performs a calculation, and returns a tuple of result data.
When I run this, I get a Java null pointer exception at the point that mySubgraphFunction calls GraphX.subgraph. If I call collect on the RDD of terms, I can get this to work (also added persist on the RDDs for performance):
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myEdges = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myGraph = Graph(myVertices, myEdges)
val myResults : Array[(<Tuple>)] = myTerms.collect().map { x =>
mySubgraphFunction(myGraph, x) }
Is there a way to get this to work where I don't have to call collect() (i.e. make this a distributed operation)? I'm creating ~1k subgraphs and the performance is slow.