Python implementation of a graph-similarity-grading algorithm

Question

I am looking for a Python implementation of an algorithm which performs the following task:

Given two directed graphs, that may contain cycles, and their roots, produce a score to the two graphs' similarity.

(The way that Python's difflib can perform for two sequences)

Hopefully, such an implementation exists. Otherwise, I'll try and implement an algorithm myself. In which case, what's the preferable algorithm to implement (with regard to simplicity).

The way the algorithm works is of no importance to me, though its' complexity is. Also, an algorithm which works with a different data-structure is also acceptable, as long as a graph, such as I described, can be represented with this DS.

I'll emphasize, an implemetation would be much nicer.

Edit:
It seems an isomorphism algortihm is not relevant. It was suggested that graph edit distance is more to the point, which narrows down my search to a solution that either executes graph edit distance or reduces a graph to a tree and then performs tree edit distance.
The nodes themseleves consist of a few lines of assembly code each.

"similarity" seems a bit vague to me. Is there a universal meaning of similarity for directed graphs that I'm unaware of? — John La Rooy, Aug 25 '12 at 12:56
Have you looked over the various [Python graph libraries](http://stackoverflow.com/questions/606516/python-graph-library) to see if any of those implement something already? — Martijn Pieters, Aug 25 '12 at 13:11
@gnibbler Thank you for your reply. By similarity I mean the following: graphs are equal if a bijection exists between the nodes of the two graphs (isomorphism). Graphs are similar if their edit distance is low. Graphs are somewhat similar if they both have a max common subgraph, and so on.. [link](http://lees-web.mit.edu/lees/presentations/LauraZager.pdf) — YaronK, Aug 25 '12 at 13:17
@YaronK every two non-empty graphs have a maximum common subgraph, unless they are labeled, of course — Qnan, Aug 25 '12 at 13:21
@MartijnPieters I have considered both NetworkX and igraph. They both allow to check for isomorphism, whether one exists or not, though none of them allow to quantitate the similarity. — YaronK, Aug 25 '12 at 13:32

score 16 · Answer 1 · answered Dec 04 '14 at 20:32

16

Another method is to use what is called Eigenvector Similarity. Basically, you calculate the Laplacian eigenvalues for the adjacency matrices of each of the graphs. For each graph, find the smallest k such that the sum of the k largest eigenvalues constitutes at least 90% of the sum of all of the eigenvalues. If the values of k are different between the two graphs, then use the smaller one. The similarity metric is then the sum of the squared differences between the largest k eigenvalues between the graphs. This will produce a similarity metric in the range [0, ∞), where values closer to zero are more similar.

For example, if using networkx:

def select_k(spectrum, minimum_energy = 0.9):
    running_total = 0.0
    total = sum(spectrum)
    if total == 0.0:
        return len(spectrum)
    for i in range(len(spectrum)):
        running_total += spectrum[i]
        if running_total / total >= minimum_energy:
            return i + 1
    return len(spectrum)

laplacian1 = nx.spectrum.laplacian_spectrum(graph1)
laplacian2 = nx.spectrum.laplacian_spectrum(graph2)

k1 = select_k(laplacian1)
k2 = select_k(laplacian2)
k = min(k1, k2)

similarity = sum((laplacian1[:k] - laplacian2[:k])**2)

answered Dec 04 '14 at 20:32

ESultanik

705
6
21

1

This looks very interesting @ESultanik, do you have reference papers for this method? Thanks! – Lucien S. Dec 08 '14 at 22:34
1

@Rodolphe [This is a student report](https://www.cs.cmu.edu/~jingx/docs/DBreport.pdf), but it acts as a relatively good introduction/survey. – ESultanik Dec 08 '14 at 23:50
Can you think of a way to normalize this similarity method so that it can detect approximate subgraphs? For example, if G1 has 100 nodes and 200 edges, and G2 has 5 nodes and 7 edges, you'd want to adjust the similarity threshold by a function of the differences in the number of nodes and edges. – Andrew Apr 20 '15 at 21:21
Also, would it make sense to use the absolute value of eigenvalues? Otherwise negatives would cause the total to be met with a smaller k. Not sure if this is what's wanted as a large negative eigenvalue would mean a big difference in the roots of the characteristic polynomials of the laplacians. – Andrew May 05 '15 at 22:51
EDIT: I think what might be even better is first sorting the eigenvalues, then taking their absolute value so that the position of the negatives stay the same. – Andrew May 05 '15 at 22:57
Hello @ESultanik! I'm very interested in this and tried to implement it in R. See pastebin here: http://pastebin.com/W320mYBx. I don't understand why if running_total/total >=0.9, you return i+1. Why plus one? For my code, I get NA sometimes and 1.286 other times. I don't understand why it wouldn't give me the same answer each time? This is using igraph in the R programming language - so require(igraph) is needed. – lrthistlethwaite Aug 28 '15 at 22:13
@areyoujokingme It's because `i` is zero-based; `spectrum[i]` is actually the `i+1`th element, which is what we want for `k`. Remember that Python has zero-based array indexing while R has one-based. I suspect your problem may be related to that; you should carefully check lines 5, 7, 13, and 22. – ESultanik Aug 31 '15 at 12:56
1

Yes, @ESultanik!!! Thanks for reminding me your implementation was in Python! You're awesome - thank you!! – lrthistlethwaite Sep 01 '15 at 15:21
If using this method, there are some edge cases (no pun intended) that one should be aware of. For instance if one is comparing the graph G: nodes = [1,2,3,4,5] edges = [[1,5], [1,4], [13], [1,2], [2,4], [2,3], [4,5]] with F: nodes = [3] edges = [] similarity according to this algorithm is 0. :0 Using this function along with other checks (difference in number of nodes and edges) is important. – Jaden Travnik Apr 13 '18 at 18:42

YaronK · Accepted Answer · 2015-08-30T09:42:12.817

What we ended up doing is implementing an algorithm described in: "Heuristics for Chemical Compound Matching".

We're using NetworkX for representing the grap and for finding the max clique.

Edit:

Basically, you create a new graph, each node (v) representing a possible pairing of a node from graph A (a) to a node from graph B (b).

If in your application the two nodes (a,b) are either similar or not, you remove nodes (v) from the new graph which correspond to dissimilar pairings (a,b). You connect two nodes with an edge if they don't contradict each other. For example, the pairings (a,b) and (a,c) contradict each other (see article for a formal definition). You then find a clique in the new graph which has the maximum amount of nodes.

If in your application the two nodes' similarity is not binary, you give the new nodes weights within a range (say (0,1)). You can remove, heuristically, remove new nodes with similarity grades lower than a predefined threshold. You then find a clique in the new graph which has the maximum weight (the sum of the nodes' assigned weights).

Either way, you finish up by generating the similarity grade: the size / total weight of the clique divided by a function of the attributes of the original graphs (maximum/minimum/average of the sizes / weights of A and B ).

A nice feature is that the you can deduce the "source" of the similarity from the clique you found - the "stronger" pairings.

Further clarification: The constraints are application dependent. We used the approach to compare pairs of function control-flow graphs. Generally, the approach finds a matching of some nodes in the first graph to some nodes in the second graph (subgraph to subgraph). Each node in the association graph symbolizes a possible matching of a single node from the first graph to a single node in the second graph. Since eventually a clique (a subset of the nodes) is selected, an edge means two matchings don't contradict each other. To apply for a different application, you should ask what are the criteria for possible pairings (or what nodes do I create), and how does selecting one pairing influence the selection of another pairing (or how do I connect the nodes with edges).

Fantastic, I'm looking for such an algorithm. Would you care sharing it? — Lucien S., Jun 25 '14 at 20:19
I'm having a hard time understanding why (a,b) and (a,c) contradict each other. Why can't a be paired with more than one node? I've tried to transcribe the algorithm above in R, but I'm stuck at the contradiction stage. Any help @YaronK? Here's a pastebin: http://pastebin.com/ua5XrTG0 — lrthistlethwaite, Aug 28 '15 at 21:40
@areyoujokingme If a the 'a' node is matched with the 'b' node, and 'b' is a different node than 'c', then 'a' can't be matched with 'c'. A node may only be matched with a single node from the other graph. — YaronK, Aug 29 '15 at 08:13
This is a prerequisite of your approach, isn't it - because it's applied to chemical structures? I'm actually wanting to use this algorithm for protein-protein interaction graphs and in this circumstance, it's certainly allowed that protein A interact with both B and C, etc. Does this make sense? For PPIs, then, is this algorithm still useful? I'm not sure.. — lrthistlethwaite, Aug 30 '15 at 02:06
@areyoujokingme I've edited the answer and tried to refer to your questions. I hope this clarifies things. — YaronK, Aug 30 '15 at 09:44

score 4 · Answer 3 · answered Jan 18 '19 at 10:50

This is an old question but I would like to share my approach. I had a CVRP (Capacitated Vehicle Routing Problem) task. My heuristic algorithm produced several different graphs in order to find a solution. In order to not get stuck at a local optimum, I used a relax and repair procedure.

At this point, I had to filter out solutions that were too similar. Since most heuristic approaches use a systematic change of neighborhoods within a local search procedure to provide solutions an Edit distance (Levenshtein distance) was perfect for me. Levenshtein algorithm has a complexity of O(n*m) where n and m is the length of two strings. So with a string representation of the graph nodes and routes I was able to figure out the similarity. The edit operations can be considered neighborhood operations so it can be considered a search space distance (not a solution space distance).

A better / generalized approach that sacrifices some speed would be Needleman-Wunsch algorithm. Needleman-Wunsch is a hybrid similarity measure that generalizes the Levenshtein distance and considers global alignment between two strings. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is the maximal score. An alignment between two strings is a set of correspondences between their characters, allowing for gaps.

For example:

import py_stringmatching as sm
nw = sm.NeedlemanWunsch(gap_cost=0.5, sim_func=lambda s1, s2: (0.0 if s1 == s2 else 1.0))
print('\nNeedleman-Wunsch: ', nw.get_raw_score('045601230', '062401530'))

In the example, you can use a custom Levenshtein algorithm.

Fast implementations of Levenshtein exist in Git (using Cython, Numpy etc).
A nice library is py_stringmatching which contain the following list of similarity algorithms:

Affine Gap
Bag Distance
Cosine
Dice
Editex
Generalized Jaccard
Hamming Distance
Jaccard
Jaro
Jaro Winkler
Levenshtein
Monge Elkan
Needleman Wunsch
Overlap Coefficient
Partial Ratio
Partial Token Sort
Ratio
Smith-Waterman
Soft TF/IDF
Soundex
TF/IDF
Token Sort
Tversky Index

if we use this method we have first to turn the graph into a string, right? like [node1[node2, node3[node4, node5]]] ? — agenis, Jan 18 '19 at 13:10
all right, and do you know any implementation/library for that? — agenis, Jan 18 '19 at 13:23
Do you start with a graph from the beginning? Do you use networkx? Most start with a graph representation like a list — Panos Kalatzantonakis, Jan 18 '19 at 13:37
Networkx has some useful functions https://networkx.github.io/documentation/latest/reference/convert.html — Panos Kalatzantonakis, Jan 18 '19 at 13:45
Is this about your question https://stackoverflow.com/q/54253477/266068 — Panos Kalatzantonakis, Jan 18 '19 at 13:49

Qnan · Answer 4 · 2012-08-25T13:51:00.980

1

I think the way you define similarity is not very standard, so it is probably easier to find the implementations for the isomorphism check, graph edit distance and maximum common subgraph and then combine those yourself.

One should understand, however, that calculating such a similarity measure might be costly, so if there're a lot of graphs you might want to do some sort of screening first.

igraph can check for isomorphism, for example.

EDIT: actually you probably only need the edit distance, since the presence of an MCS is usually irrelevant, as I pointed out above, and two graphs will be isomorphic anyway if the edit distance is 0.

edited Aug 25 '12 at 13:51

answered Aug 25 '12 at 13:27

Qnan

3,714
18
15

Thank you for your reply. How about graph edit distance or maximum common subgraph - are you familiar with any Python implementations of alogithms trying to solve these problems (preferably not costly)? – YaronK Aug 25 '12 at 13:53
@YaronK any *exact* graph edit distance algorithm available is going to be costly, i.e. non-polynomial, since the problem is at least as hard as that of the graph isomorphism http://en.wikipedia.org/wiki/Graph_isomorphism_problem. I don't know of any implementation of such an algorithm, but would be interested to see one. – Qnan Aug 25 '12 at 13:58
@YaronK it also seems that the literature describes a number of efficient approximate solutions that work well for certain kinds of graphs. Where do those you're looking at come from? – Qnan Aug 25 '12 at 14:02
The graphs I'm refering to are actually functions' control-flows represented as graphs. – YaronK Aug 25 '12 at 14:08
@YaronK that means the nodes are labelled with statements and/or function names, right? – Qnan Aug 25 '12 at 14:21
Each node is in fact a few consequtive assembly instructions. Does that help? – YaronK Aug 25 '12 at 14:32
@YaronK I'm not sure. On one hand, that actually makes the calculation of the edit distance even more complex (computationally), since more moves are allowed. On the other hand, it sometimes allows to define a simple lower bound for the edit distance based on the set of nodes present, which can be helpful as in most applications one does not care about items that are further apart then some fixed number of edits. – Qnan Aug 25 '12 at 18:39

Python implementation of a graph-similarity-grading algorithm

4 Answers4