Proper subgraphing of a PySpark GraphFrame

Question

graphframes is a network analysis tool based on PySpark DataFrames. The following code is a modified version of the tutorial subgraphing example:

from graphframes.examples import Graphs
import graphframes
g = Graphs(sqlContext).friends()  # Get example graph
# Select subgraph of users older than 30
v2 = g.vertices.filter("age > 30")
g2 = graphframes.GraphFrame(v2, g.edges)

One would expect that the new graph, g2 will contain fewer nodes and fewer edges, compared to the original one, g. However, this is not the case:

print(g.vertices.count(), g.edges.count())
print(g2.vertices.count(), g2.edges.count())

Gives the output:

(6, 7)
(7, 4)

It is obvious that the resulting graph contains edges for non-existing nodes. Even more disturbing is the fact that g.degrees and g2.degrees are identical. This means that at least some of graph functionality ignores the nodes information. Is there a good way to make sure that GraphFrame creates a graph using only the intersection of the supplied nodes and edges arguments?

score 4 · Answer 1 · edited Aug 15 '16 at 23:46

A method that I use to subgraph a graphframe is using motifs:

motifs = g.find("(a)-[e]->(b)").filter(<conditions for a,b or e>)
new_vertices = sqlContext.createDataFrame(motifs.map(lambda row: row.a).union(motifs.map(lambda row: row.b)).distinct())
new_edges = sqlContext.createDataFrame(motifs.map(lambda row:row.e).distinct())
new_graph = GraphFrame(new_vertices,new_edges)

While this looks more complicated and possibly takes longer in terms of runtime, for more complicated graph queries, this serves well as you interact with the graphframe as a single entity rather than as vertices and edges being separate. So, filtering on vertices also influences edges left in the graphframe.

score 3 · Answer 2 · answered Jun 15 '16 at 00:51

Interesting.. I'm not able to see that result:

>>> from graphframes.examples import Graphs
>>> import graphframes
>>> g = Graphs(sqlContext).friends()  # Get example graph
>>> # Select subgraph of users older than 30
... v2 = g.vertices.filter("age > 30")
>>> g2 = graphframes.GraphFrame(v2, g.edges)
>>> print(g.vertices.count(), g.edges.count())
(6, 7)
>>> print(g2.vertices.count(), g2.edges.count())
(4, 7)

GraphFrames as of now does not check if the graph is valid - ie. all the edges are connects to vertices and so on, at graph construction time. But seems like the number of vertices is correct after the filter?

> But seems like the number of vertices is correct after the filter? It is, but not the number of edges. Removing vertices should have resulted in removing some edges too — Boris Gorelik, Jun 16 '16 at 07:55

kostjaigin · Answer 3 · 2021-03-22T16:23:32.990

My work-arounds may not be the perfect ones, but they work for me.

Problem statement as I got it: having a filtered collection of nodes filtered_nodes, we only want to have the edges from the original graph that include nodes from filtered_nodes.

Method 1: Using joins (costly)

edgesframe = graphframe.edges
src_join = edgesframe.join(filtered_nodes, (edgesframe.src == subgraph_nodes.id), "inner").withColumnRenamed("src", "srcto")
dst_join = edgesframe.join(filtered_nodes, (edgesframe.dst == subgraph_nodes.id), "inner").withColumnRenamed("dst", "dstto")
final_join = src_join.join(dst_join, (src_join.src == dst_join.src) & (src_join.dst == dst_join.dst), "inner").select("src", "dst")
g2 = GraphFrame(filtered_nodes, final_join)

Method 2: Using collected collection as a list-reference for isin-method (I'd only use it on small collections of filter nodes)

edgesframe = graphframe.edges
collected_nodes = subgraph_nodes.select("columnWeUseForReference").rdd.map(lambda r: r[0]).collect()
edgs = edgesframe.filter(edgesframe.src.isin(collected_nodes) & edgesframe.dst.isin(collected_nodes))

Does someone have a better approach? I'd be really happy to see it.

score 0 · Answer 4 · edited Dec 27 '22 at 09:50

0

I recommend using dropIsolatedVertices().

edited Dec 27 '22 at 09:50

Yunnosch

26,130
9
42
54

answered Dec 27 '22 at 02:50

passer-by

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 27 '22 at 08:36

Proper subgraphing of a PySpark GraphFrame

4 Answers4