Given: Large edge list, with about 90-100 million edges, 2 million nodes. These edges compose a giant graph with thousands of connected components.
I also have a list of nodes of interest. I want to extract the edges of only the CCs containing these nodes.
For example here's a tiny example: Consider a small graph with 3 CCs inside it.
edge_list = [(1,2), (3,2), (1,1), (4,5), (5,6), (4,6), (7,8), (8,9), (9,7)] #three CCs
## Below commands for visualization sake
G = nx.Graph()
G.add_edges_from(edges)
nx.draw(G, with_labels=True)
And I have nodes of interest: NOI = [1,5]
I want a function that does the following:
CC_list = find_ccs(edge_list, NOI)
print(CC_list)
#output: [[(1,2), (3,2), (1,1)],
# [(4,5), (5,6), (4,6)]]
Notice that only the edges of the two CCs that contain the NOIs are returned. I want to do this on a massive scale.
I'm open to any solution using Networkx, StellarGraph, or, most preferably PySpark. I can read the edge list using pyspark and once I have the filtered edge list, I can convert it to a CC using whatever library.
NOTE: The entire graph itself is too large to build fully using networkx or stellargraph. Thus, I have to take in an edge list and only extract the ccs I want. The above example with networkx is just for visualization purposes.