graphframes is a network analysis tool based on PySpark DataFrames. The following code is a modified version of the tutorial subgraphing example:
from graphframes.examples import Graphs
import graphframes
g = Graphs(sqlContext).friends() # Get example graph
# Select subgraph of users older than 30
v2 = g.vertices.filter("age > 30")
g2 = graphframes.GraphFrame(v2, g.edges)
One would expect that the new graph, g2
will contain fewer nodes and fewer edges, compared to the original one, g
.
However, this is not the case:
print(g.vertices.count(), g.edges.count())
print(g2.vertices.count(), g2.edges.count())
Gives the output:
(6, 7)
(7, 4)
It is obvious that the resulting graph contains edges for non-existing nodes.
Even more disturbing is the fact that g.degrees
and g2.degrees
are identical. This means that at least some of graph functionality ignores the nodes information. Is there a good way to make sure that GraphFrame
creates
a graph using only the intersection of the supplied nodes
and edges
arguments?