1

Given: Large edge list, with about 90-100 million edges, 2 million nodes. These edges compose a giant graph with thousands of connected components.

I also have a list of nodes of interest. I want to extract the edges of only the CCs containing these nodes.

For example here's a tiny example: Consider a small graph with 3 CCs inside it.

edge_list = [(1,2), (3,2), (1,1), (4,5), (5,6), (4,6), (7,8), (8,9), (9,7)] #three CCs
## Below commands for visualization sake
G = nx.Graph() 
G.add_edges_from(edges)
nx.draw(G, with_labels=True)

And I have nodes of interest: NOI = [1,5]

I want a function that does the following:

CC_list = find_ccs(edge_list, NOI)
print(CC_list)
#output: [[(1,2), (3,2), (1,1)],
#          [(4,5), (5,6), (4,6)]]

Notice that only the edges of the two CCs that contain the NOIs are returned. I want to do this on a massive scale.

I'm open to any solution using Networkx, StellarGraph, or, most preferably PySpark. I can read the edge list using pyspark and once I have the filtered edge list, I can convert it to a CC using whatever library.

NOTE: The entire graph itself is too large to build fully using networkx or stellargraph. Thus, I have to take in an edge list and only extract the ccs I want. The above example with networkx is just for visualization purposes.

Jonathan
  • 1,876
  • 2
  • 20
  • 56
  • If graph can be undirected you can try finding [component that node belong to](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.node_connected_component.html#networkx.algorithms.components.node_connected_component) and then get the edges – Abhi Jul 05 '21 at 08:10
  • @Abhi the graph isn't built yet, it is in the form of an edge list only. – Jonathan Jul 05 '21 at 18:10

1 Answers1

0

Pick a node of interest. Run BFS from it to find the connected component that contains it. Each time you add a node to the CC, check if it's a node-of-interest and remove it from the set to check if it is. Repeat until you've found all the CCs containing nodes of interest.

Running time is O(V+E), where V & E are the nodes and edges of CCs that contain nodes of interest, not the larger graph.

Dave
  • 7,460
  • 3
  • 26
  • 39
  • Quick clarification, how would I run BFS if the graph isn't built yet and I only have an edge list? – Jonathan Jul 05 '21 at 18:09
  • @Jonathan For BFS we need to be able to get the neighbors of each node. You can go through the edge array and build a map from each node to the nodes that touch it. I'm not a Python user, but believe that it will fit in memory as long as you use integers & pairs of ints for nodes and edges. – Dave Jul 05 '21 at 23:17
  • @Jonathan https://stackoverflow.com/questions/855191/how-big-can-a-python-list-get – Dave Jul 05 '21 at 23:21