1

I have been trying to build a graph for a project and I have been trying to identify newly added edges after populating it with more information.

For instance below you can see its first and second iteration:

---------------------- General Info Graph H-----------------------------

Total number of Nodes in Graph:  2364
Total number of Edges:  3151

---------------------- General Info Graph G -----------------------------

Total number of Nodes in Graph:  6035
Total number of Edges:  11245

The problem I have been facing is when I try to identify newly added edges using the code:

counter = 0
edges_all = list(G.edges_iter(data=True)) 
edges_before = list(H.edges_iter(data=True)) 
print "How many edges in old graph: ", len(edges_before)
print "How many edges in new graph: ", len(edges_all)
edge_not_found = []
for edge in edges_all:
    if edge in edges_before:
        counter += 1
    else:
        edge_not_found.append(edge)
print "Edges found: ", counter
print "Not found: ", len(edge_not_found)

And I have been getting these results:

How many edges in old graph:  3151
How many edges in new graph:  11245
Edges found:  1601
Not found:  9644

I can't understand why I am getting 1601 found instead of 11245-3151 = 8094

Any ideas?

Thank you!

Swan87
  • 421
  • 6
  • 23

1 Answers1

1

TL/DR: There's a simple explanation for what you see, and if you get to the end, there is a much shorter way to write your code (with a lot of explanation along the way).


First note that it looks like Edges found is intended to be the number of edges that are in both H and G. So it should only have 3151, not 8094. 8094 should be Not found. Note that the number of edges found, 1601, is about half the number that you would expect. That makes sense because:

I believe the problem you are having is that when networkx lists out the edges an edge might appear as (a,b) in edges_before. However in edges_after, it might appear in the list as (b,a).

So (b,a) will not be in edges_before. It will fail your test. Assuming the edge orders aren't correlated between when they are listed for H and G, you'd expect to find about half of them pass. You can do a different test to see if (b,a) is an edge of H. This is H.has_edge(b,a)

A straightforward improvement:

for edge in edges_all:
    if H.has_edge(edge[0],edge[1]):
        counter += 1
    else:
        edge_not_found.append(edge)

This lets you avoid even defining edges_before.

You can also avoid defining edges_all through a better improvement:

for edge in G.edges_iter(data=True):
    if H.has_edge(edge[0],edge[1]):
        etc

Note: I've written it as H.has_edge(edge[0],edge[1]) to make clear what's happening. A more sophisticated way to write it is H.has_edge(*edge). The *edge notation unpacks the tuple.

Finally, using a list comprehension gives a better way to get edge_not_found:

edge_not_found = [edge for edge in G.edges_iter(data=True) if not H.has_edge(*edge)]

This creates a list made up of edges which are in G but not in H.

Putting this all together (and using the .size() command to count edges in a network), we arrive at a cleaner version:

print "How many edges in old graph: ", H.size()
print "How many edges in new graph: ", G.size()
edge_not_found = [edge for edge in G.edges_iter(data=True) if not H.has_edge(*edge)]
print "Not found: ", len(edge_not_found)
print "Edges found: ", G.size()-len(edge_not_found)
Community
  • 1
  • 1
Joel
  • 22,598
  • 6
  • 69
  • 93
  • Thanks! Yeah it worked like a charm. The only thing is for the edges_not_found line to work as expected if someone has populated the graph edges with data, it should be: edge_not_found = [edge for edge in G.edges_iter() if not H.has_edge(*edge)] or else it returns TypeError: has_edge() takes exactly 3 arguments (4 given). – Swan87 Apr 08 '15 at 09:05