24

I am looking for a Python implementation of an algorithm which performs the following task:

Given two directed graphs, that may contain cycles, and their roots, produce a score to the two graphs' similarity.

(The way that Python's difflib can perform for two sequences)

Hopefully, such an implementation exists. Otherwise, I'll try and implement an algorithm myself. In which case, what's the preferable algorithm to implement (with regard to simplicity).

The way the algorithm works is of no importance to me, though its' complexity is. Also, an algorithm which works with a different data-structure is also acceptable, as long as a graph, such as I described, can be represented with this DS.

I'll emphasize, an implemetation would be much nicer.

Edit:
It seems an isomorphism algortihm is not relevant. It was suggested that graph edit distance is more to the point, which narrows down my search to a solution that either executes graph edit distance or reduces a graph to a tree and then performs tree edit distance.
The nodes themseleves consist of a few lines of assembly code each.

River
  • 8,585
  • 14
  • 54
  • 67
YaronK
  • 782
  • 1
  • 7
  • 14
  • 1
    "similarity" seems a bit vague to me. Is there a universal meaning of similarity for directed graphs that I'm unaware of? – John La Rooy Aug 25 '12 at 12:56
  • Have you looked over the various [Python graph libraries](http://stackoverflow.com/questions/606516/python-graph-library) to see if any of those implement something already? – Martijn Pieters Aug 25 '12 at 13:11
  • @gnibbler Thank you for your reply. By similarity I mean the following: graphs are equal if a bijection exists between the nodes of the two graphs (isomorphism). Graphs are similar if their edit distance is low. Graphs are somewhat similar if they both have a max common subgraph, and so on.. [link](http://lees-web.mit.edu/lees/presentations/LauraZager.pdf) – YaronK Aug 25 '12 at 13:17
  • @YaronK every two non-empty graphs have a maximum common subgraph, unless they are labeled, of course – Qnan Aug 25 '12 at 13:21
  • 1
    @MartijnPieters I have considered both NetworkX and igraph. They both allow to check for isomorphism, whether one exists or not, though none of them allow to quantitate the similarity. – YaronK Aug 25 '12 at 13:32

4 Answers4

16

Another method is to use what is called Eigenvector Similarity. Basically, you calculate the Laplacian eigenvalues for the adjacency matrices of each of the graphs. For each graph, find the smallest k such that the sum of the k largest eigenvalues constitutes at least 90% of the sum of all of the eigenvalues. If the values of k are different between the two graphs, then use the smaller one. The similarity metric is then the sum of the squared differences between the largest k eigenvalues between the graphs. This will produce a similarity metric in the range [0, ∞), where values closer to zero are more similar.

For example, if using networkx:

def select_k(spectrum, minimum_energy = 0.9):
    running_total = 0.0
    total = sum(spectrum)
    if total == 0.0:
        return len(spectrum)
    for i in range(len(spectrum)):
        running_total += spectrum[i]
        if running_total / total >= minimum_energy:
            return i + 1
    return len(spectrum)

laplacian1 = nx.spectrum.laplacian_spectrum(graph1)
laplacian2 = nx.spectrum.laplacian_spectrum(graph2)

k1 = select_k(laplacian1)
k2 = select_k(laplacian2)
k = min(k1, k2)

similarity = sum((laplacian1[:k] - laplacian2[:k])**2)
ESultanik
  • 705
  • 6
  • 21
  • 1
    This looks very interesting @ESultanik, do you have reference papers for this method? Thanks! – Lucien S. Dec 08 '14 at 22:34
  • 1
    @Rodolphe [This is a student report](https://www.cs.cmu.edu/~jingx/docs/DBreport.pdf), but it acts as a relatively good introduction/survey. – ESultanik Dec 08 '14 at 23:50
  • Can you think of a way to normalize this similarity method so that it can detect approximate subgraphs? For example, if G1 has 100 nodes and 200 edges, and G2 has 5 nodes and 7 edges, you'd want to adjust the similarity threshold by a function of the differences in the number of nodes and edges. – Andrew Apr 20 '15 at 21:21
  • Also, would it make sense to use the absolute value of eigenvalues? Otherwise negatives would cause the total to be met with a smaller k. Not sure if this is what's wanted as a large negative eigenvalue would mean a big difference in the roots of the characteristic polynomials of the laplacians. – Andrew May 05 '15 at 22:51
  • EDIT: I think what might be even better is first sorting the eigenvalues, then taking their absolute value so that the position of the negatives stay the same. – Andrew May 05 '15 at 22:57
  • Hello @ESultanik! I'm very interested in this and tried to implement it in R. See pastebin here: http://pastebin.com/W320mYBx. I don't understand why if running_total/total >=0.9, you return i+1. Why plus one? For my code, I get NA sometimes and 1.286 other times. I don't understand why it wouldn't give me the same answer each time? This is using igraph in the R programming language - so require(igraph) is needed. – lrthistlethwaite Aug 28 '15 at 22:13
  • @areyoujokingme It's because `i` is zero-based; `spectrum[i]` is actually the `i+1`th element, which is what we want for `k`. Remember that Python has zero-based array indexing while R has one-based. I suspect your problem may be related to that; you should carefully check lines 5, 7, 13, and 22. – ESultanik Aug 31 '15 at 12:56
  • 1
    Yes, @ESultanik!!! Thanks for reminding me your implementation was in Python! You're awesome - thank you!! – lrthistlethwaite Sep 01 '15 at 15:21
  • If using this method, there are some edge cases (no pun intended) that one should be aware of. For instance if one is comparing the graph G: nodes = [1,2,3,4,5] edges = [[1,5], [1,4], [13], [1,2], [2,4], [2,3], [4,5]] with F: nodes = [3] edges = [] similarity according to this algorithm is 0. :0 Using this function along with other checks (difference in number of nodes and edges) is important. – Jaden Travnik Apr 13 '18 at 18:42
4

What we ended up doing is implementing an algorithm described in: "Heuristics for Chemical Compound Matching".

We're using NetworkX for representing the grap and for finding the max clique.

Edit:

Basically, you create a new graph, each node (v) representing a possible pairing of a node from graph A (a) to a node from graph B (b).

If in your application the two nodes (a,b) are either similar or not, you remove nodes (v) from the new graph which correspond to dissimilar pairings (a,b). You connect two nodes with an edge if they don't contradict each other. For example, the pairings (a,b) and (a,c) contradict each other (see article for a formal definition). You then find a clique in the new graph which has the maximum amount of nodes.

If in your application the two nodes' similarity is not binary, you give the new nodes weights within a range (say (0,1)). You can remove, heuristically, remove new nodes with similarity grades lower than a predefined threshold. You then find a clique in the new graph which has the maximum weight (the sum of the nodes' assigned weights).

Either way, you finish up by generating the similarity grade: the size / total weight of the clique divided by a function of the attributes of the original graphs (maximum/minimum/average of the sizes / weights of A and B ).

A nice feature is that the you can deduce the "source" of the similarity from the clique you found - the "stronger" pairings.

Further clarification: The constraints are application dependent. We used the approach to compare pairs of function control-flow graphs. Generally, the approach finds a matching of some nodes in the first graph to some nodes in the second graph (subgraph to subgraph). Each node in the association graph symbolizes a possible matching of a single node from the first graph to a single node in the second graph. Since eventually a clique (a subset of the nodes) is selected, an edge means two matchings don't contradict each other. To apply for a different application, you should ask what are the criteria for possible pairings (or what nodes do I create), and how does selecting one pairing influence the selection of another pairing (or how do I connect the nodes with edges).

YaronK
  • 782
  • 1
  • 7
  • 14
  • 1
    Fantastic, I'm looking for such an algorithm. Would you care sharing it? – Lucien S. Jun 25 '14 at 20:19
  • I'm having a hard time understanding why (a,b) and (a,c) contradict each other. Why can't a be paired with more than one node? I've tried to transcribe the algorithm above in R, but I'm stuck at the contradiction stage. Any help @YaronK? Here's a pastebin: http://pastebin.com/ua5XrTG0 – lrthistlethwaite Aug 28 '15 at 21:40
  • @areyoujokingme If a the 'a' node is matched with the 'b' node, and 'b' is a different node than 'c', then 'a' can't be matched with 'c'. A node may only be matched with a single node from the other graph. – YaronK Aug 29 '15 at 08:13
  • This is a prerequisite of your approach, isn't it - because it's applied to chemical structures? I'm actually wanting to use this algorithm for protein-protein interaction graphs and in this circumstance, it's certainly allowed that protein A interact with both B and C, etc. Does this make sense? For PPIs, then, is this algorithm still useful? I'm not sure.. – lrthistlethwaite Aug 30 '15 at 02:06
  • @areyoujokingme I've edited the answer and tried to refer to your questions. I hope this clarifies things. – YaronK Aug 30 '15 at 09:44
4

This is an old question but I would like to share my approach. I had a CVRP (Capacitated Vehicle Routing Problem) task. My heuristic algorithm produced several different graphs in order to find a solution. In order to not get stuck at a local optimum, I used a relax and repair procedure.

At this point, I had to filter out solutions that were too similar. Since most heuristic approaches use a systematic change of neighborhoods within a local search procedure to provide solutions an Edit distance (Levenshtein distance) was perfect for me. Levenshtein algorithm has a complexity of O(n*m) where n and m is the length of two strings. So with a string representation of the graph nodes and routes I was able to figure out the similarity. The edit operations can be considered neighborhood operations so it can be considered a search space distance (not a solution space distance).

A better / generalized approach that sacrifices some speed would be Needleman-Wunsch algorithm. Needleman-Wunsch is a hybrid similarity measure that generalizes the Levenshtein distance and considers global alignment between two strings. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is the maximal score. An alignment between two strings is a set of correspondences between their characters, allowing for gaps.

For example:

import py_stringmatching as sm
nw = sm.NeedlemanWunsch(gap_cost=0.5, sim_func=lambda s1, s2: (0.0 if s1 == s2 else 1.0))
print('\nNeedleman-Wunsch: ', nw.get_raw_score('045601230', '062401530'))

In the example, you can use a custom Levenshtein algorithm.

Fast implementations of Levenshtein exist in Git (using Cython, Numpy etc).
A nice library is py_stringmatching which contain the following list of similarity algorithms:

  • Affine Gap
  • Bag Distance
  • Cosine
  • Dice
  • Editex
  • Generalized Jaccard
  • Hamming Distance
  • Jaccard
  • Jaro
  • Jaro Winkler
  • Levenshtein
  • Monge Elkan
  • Needleman Wunsch
  • Overlap Coefficient
  • Partial Ratio
  • Partial Token Sort
  • Ratio
  • Smith-Waterman
  • Soft TF/IDF
  • Soundex
  • TF/IDF
  • Token Sort
  • Tversky Index
Panos Kalatzantonakis
  • 12,525
  • 8
  • 64
  • 85
1

I think the way you define similarity is not very standard, so it is probably easier to find the implementations for the isomorphism check, graph edit distance and maximum common subgraph and then combine those yourself.

One should understand, however, that calculating such a similarity measure might be costly, so if there're a lot of graphs you might want to do some sort of screening first.

igraph can check for isomorphism, for example.

EDIT: actually you probably only need the edit distance, since the presence of an MCS is usually irrelevant, as I pointed out above, and two graphs will be isomorphic anyway if the edit distance is 0.

Qnan
  • 3,714
  • 18
  • 15
  • Thank you for your reply. How about graph edit distance or maximum common subgraph - are you familiar with any Python implementations of alogithms trying to solve these problems (preferably not costly)? – YaronK Aug 25 '12 at 13:53
  • @YaronK any *exact* graph edit distance algorithm available is going to be costly, i.e. non-polynomial, since the problem is at least as hard as that of the graph isomorphism http://en.wikipedia.org/wiki/Graph_isomorphism_problem. I don't know of any implementation of such an algorithm, but would be interested to see one. – Qnan Aug 25 '12 at 13:58
  • @YaronK it also seems that the literature describes a number of efficient approximate solutions that work well for certain kinds of graphs. Where do those you're looking at come from? – Qnan Aug 25 '12 at 14:02
  • The graphs I'm refering to are actually functions' control-flows represented as graphs. – YaronK Aug 25 '12 at 14:08
  • @YaronK that means the nodes are labelled with statements and/or function names, right? – Qnan Aug 25 '12 at 14:21
  • Each node is in fact a few consequtive assembly instructions. Does that help? – YaronK Aug 25 '12 at 14:32
  • @YaronK I'm not sure. On one hand, that actually makes the calculation of the edit distance even more complex (computationally), since more moves are allowed. On the other hand, it sometimes allows to define a simple lower bound for the edit distance based on the set of nodes present, which can be helpful as in most applications one does not care about items that are further apart then some fixed number of edits. – Qnan Aug 25 '12 at 18:39