26

How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. We can assume that the number of vertices in the graph is at most 11.

I am particularly interested in Python code.

Here is what I did. If self.lt is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). Then, I hash the sorted dictionary. Some isomorphic graphs will hash to different values, especially as the number of nodes grows.

I have included all the code to motivate my use case. I am calculating the number of comparisons required to find the median of 7 numbers. The more that isomorphic graphs hash to the same value the less work that has to be redone. I considered putting larger connected components first, but didn't see how to do that quickly.

from tools.decorator import memoized  # A standard memoization decorator


class Graph:
    def __init__(self, n):
        self.lt = {i: set() for i in range(n)}

    def compared(self, i, j):
        return j in self.lt[i] or i in self.lt[j]

    def withedge(self, i, j):
        retval = Graph(len(self.lt))
        implied_lt = self.lt[j] | set([j])
        for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
                                        retval.lt.items()):
            lt_k |= lt_s
            if i in lt_k or k == i:
                lt_k |= implied_lt
        return retval.toposort()

    def toposort(self):
        mapping = {}
        while len(mapping) < len(self.lt):
            for i, lt_i in self.lt.items():
                if i in mapping:
                    continue
                if any(i in lt_j or len(lt_i) < len(lt_j)
                       for j, lt_j in self.lt.items()
                       if j not in mapping):
                    continue
                mapping[i] = len(mapping)
        retval = Graph(0)
        for i, lt_i in self.lt.items():
            retval.lt[mapping[i]] = {mapping[j]
                                     for j in lt_i}
        return retval

    def median_known(self):
        n = len(self.lt)
        for i, lt_i in self.lt.items():
            if len(lt_i) != n // 2:
                continue
            if sum(1
                   for j, lt_j in self.lt.items()
                   if i in lt_j) == n // 2:
                return True
        return False

    def __repr__(self):
        return("[{}]".format(", ".join("{}: {{{}}}".format(
            i,
            ", ".join(str(x) for x in lt_i))
                                       for i, lt_i in self.lt.items())))

    def hashkey(self):
        return tuple(sorted({k: tuple(sorted(v))
                             for k, v in self.lt.items()}.items()))

    def __hash__(self):
        return hash(self.hashkey())

    def __eq__(self, other):
        return self.hashkey() == other.hashkey()


@memoized
def mincomps(g):
    print("Calculating:", g)
    if g.median_known():
        return 0
    nodes = g.lt.keys()
    return 1 + min(max(mincomps(g.withedge(i, j)),
                       mincomps(g.withedge(j, i)))
                   for i in nodes
                   for j in nodes
                   if j > i and not g.compared(i, j))


g = Graph(7)
print(mincomps(g))
Neil G
  • 32,138
  • 39
  • 156
  • 257
  • 3
    Goggling "Hash value for a graph" has led me to this interesting paper by Ashish Kundu and Elisa Bertino called "[On Hashing Graph](http://webcache.googleusercontent.com/search?q=cache:RIrkVHevAlkJ:eprint.iacr.org/2012/352.ps+&cd=1&hl=en&ct=clnk&gl=us&client=firefox-a)" about a solution to Hashing DAGs (using 2 O(1) operations). I'm not at a level where I could distill this paper into an answer to your question, but I'm having fun reading about it :) – Jason Sperske Jan 25 '13 at 23:59
  • 1
    Also there is something called "Merkle Signature Schemes" which the Kundu & Bertino paper site as a starting solution, if that helps – Jason Sperske Jan 26 '13 at 00:00
  • 5
    Are there labels at the vertices or edges? If not, should isomorphic graphs hash to the same value? – Henry Jan 28 '13 at 07:21
  • 9
    Need the hash be **unique**, or only usually unique? (The latter is all that is required for a Python object hashing function.) –  Jan 28 '13 at 08:10
  • Are we talking about a hash function or a cryptographic hash function? – Daniel Brückner Jan 28 '13 at 13:34
  • @Henry: Yes, they should hash to the same value! Otherwise, this is just hashing a dictionary. – Neil G Jan 28 '13 at 15:44
  • @duskwuff: It should be as unique as possible since I'm using the dag as a key for dynamic programming. – Neil G Jan 28 '13 at 15:45
  • @DanielBrückner: Just a regular hash function. – Neil G Jan 28 '13 at 15:45
  • @NeilG If you have your own solution, you should post it as an answer, don't put it inside the question. – svick Jan 28 '13 at 16:57
  • @svick: You're right. I wanted to show the whole problem I'm trying to solve to motivate my solution. But, I am still looking for improvements to the `toposort` method so that isomorphic graphs will hash the same. – Neil G Jan 28 '13 at 17:07
  • @duskwuff: what did you mean when you said that a hash function only has to be usually unique for a Python object hashing function? – Neil G Jan 28 '13 at 17:09
  • @NeilG: For a Python `__hash__` function, the only hard requirement is that the hash value for two identical objects always be identical. Collisions are not ideal, but they aren't the end of the world, especially if they're rare. –  Jan 28 '13 at 17:15
  • @duskwuff: Okay, I think I worded my question in a confusing way, and I've tried to clear up what I meant by unique. – Neil G Jan 28 '13 at 19:43
  • 2
    @NeilG are you implying that you are after an algorithm that will effectively detect whether two graphs are isomorphic (GI) ? You are aware that it is not known whether GI is in `P` or `NP` (supposing `NP != P`), right ? I'm not aware of anything correct that beats nauty (http://cs.anu.edu.au/~bdm/nauty/). I remember something from some years ago proving that GI was in `P` (the author also included an `O(n^5)` algorithm), but the proof is flawed and I'm not sure if it ended being published or not. – mmgp Jan 28 '13 at 21:40
  • @mmgp: My algorithm is approximate in that for two isomorphic graphs it usually, but not always returns the same canonical representation using the `toposort` method. Ideally, this would be improved. Note that there are very few vertices and so exponential time is probably fine. – Neil G Jan 28 '13 at 22:01
  • @mmgp: nauty is exactly what I'm looking for. Too bad it's not available in Python or as part of `networkx`. – Neil G Jan 28 '13 at 22:29
  • @NeilG I remember something called `pynauty` that was a wrapper for nauty. – mmgp Jan 28 '13 at 22:57
  • @mmgp: Please add an answer. That looks correct to me! – Neil G Jan 28 '13 at 23:04

10 Answers10

12

To effectively test for graph isomorphism you will want to use nauty. Specifically for Python there is the wrapper pynauty, but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py). If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs.

Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). Here is the short patch which also considers the modifications needed in setup.py:

diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py   2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py        2013-01-28 22:09:07.000000000 -0200
@@ -31,7 +31,9 @@

 ext_pynauty = Extension(
         name = MODULE + '._pynauty',
-        sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+        sources = [ pynauty_dir + '/' + 'pynauty.c',
+            os.path.join(nauty_dir, 'schreier.c'),
+            os.path.join(nauty_dir, 'naurng.c')],
         depends = [ pynauty_dir + '/' + 'pynauty.h', ],
         extra_compile_args = [ '-O4' ],
         extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c      2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c   2013-01-29 00:38:36.000000000 -0200
@@ -320,7 +320,7 @@
     PyObject *adjlist;
     PyObject *p;

-    int i,j;
+    Py_ssize_t i, j;
     int adjlist_length;
     int x, y;
mmgp
  • 18,901
  • 3
  • 53
  • 80
  • 1
    I guess I have the responsibility of selecting the winner of the bounty, so if this answer works for @NeilG then it works for me, however the overall response to this question has been a great learning experience for me as I take on more advanced computer science topics (I've been working on graphs lately). I feel like SO can act like an ad hoc classroom discussion for any topic and all it cost me was a week of correct answers to other questions, I love this site :) – Jason Sperske Jan 30 '13 at 21:54
8

Graph isomorphism for directed acyclic graphs is still GI-complete. Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes.

Okay, let's brute force this for a small number of vertices. We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. Further this representation must ensure that no two non-isomorphic graphs yield the same representation.

The simplest solution is to construct the adjacency matrix for all n! permutations of the vertices and just interpret the adjacency matrix as n2 bit integer. Then we can just pick the smallest or largest of this numbers as canonical representation. This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function. And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation.

How good or bad is this in the case of 11 vertices? Well, the representation will have 121 bits. We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. This number could in theory be decreased further; not all 2110 remaining graphs are acyclic and for each graph there may be up to 11! - roughly 225 - isomorphic representations but in practice this might be quite hard to do. Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices?

How long will it take to find this representation? Naively 11! or 39,916,800 iterations. This is not nothing and probably already impractical but I did not implement and test it. But we can probably speed this up a bit. If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left.

There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. Maybe there are or maybe somebody else can at least build something upon this idea.

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
  • Thank you, this is a very useful answer. However if the number of vertices is at most, say, 11, exponential time would be fine for my purpose :) – Neil G Jan 28 '13 at 19:33
  • 2
    `hash(input) = 1` admittedly has many collisions, but it is sub-exponential and any two isomorphic directed acyclic graphs will yield the same hash. – example Jan 28 '13 at 20:39
  • 2
    Since a directed acyclic graph admits a topological sort, you could assume that the vertices are topologically sorted and thus the adjacency matrix has only zeros below the diagonal. +1 for your work so far. – Neil G Jan 28 '13 at 20:53
  • Performing a topological sort infers with sorting the vertices to maximize (minimize) the representation. Maybe one could try to find the maximal (minimal) representation only considering topological orderings but ad hoc I am unable to tell if this constraint will make the problem easier or harder. – Daniel Brückner Jan 28 '13 at 21:07
  • It's easier since you can reduce the number of free bits to 55. – Neil G Jan 28 '13 at 21:11
  • That's true but you are no longer able to swap arbitrary vertices because you might destroy the topological ordering. I tend to believe that this is not a big issue and is far outweighed by the reduced search space but I am really not sure. – Daniel Brückner Jan 28 '13 at 21:15
3

How good does the hash have to be? I assume that you do not want a full serialization of the graph. A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. E.g.:

  • the total number of nodes
  • the total number of (directed) connections
  • the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n))

All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. Concatenate them and you have a hash. If you prefer you can use some well known hash algorithm like sha on these concatenated informations. Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties.

As it is, it is already good enough to register any added or removed connection. It might miss connections that were changed though (a -> c instead of a -> b).


This approach is modular and can be extended as far as you like. Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. Some more ideas:

  • same as above but with second order in- and outdegree. Ie. the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps.
  • or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) )
  • number of nodes with eccentricity = x (again for any x)
  • if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. Due to the xor the specific order in which the nodes where added to the hash does not matter.

You requested "a unique hash value" and clearly I cannot offer you one. But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. A "unique hash" (perfect hash) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes).

Of course this is much more complex though.

example
  • 3,349
  • 1
  • 20
  • 29
  • I clearly did not word my question well, and I have tried to reword it to clear up any confusion. Please let me know if it makes sense now… – Neil G Jan 28 '13 at 19:40
  • @NeilG "It is acceptable, but undesirable for two isomorphic graphs to hash to different values" - but is it acceptable at all for two different graphs to have the same hash? – example Jan 28 '13 at 20:41
  • @NeilG ok. Then I will stick with my answer =) – example Jan 28 '13 at 20:50
3

Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them).

I named it "Powerhash", and to create the algorithm it required two insights. The first is the power iteration graph algorithm, also used in PageRank. The second is the ability to replace power iteration's inside step function with anything that we want. I replaced it with a function that does the following on each step, and for each node:

  • Sort the hashes of the node's neighbors
  • Hash the concatenated sorted hashes

On the first step, a node's hash is affected by its direct neighbors. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.

To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them in the internal hashes that you calculate for each node (and at each step).

For more on this you can look at my post here:

https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF

The algorithm above was implemented inside the "madIS" functional relational database. You can find the source code of the algorithm here:

https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

estama
  • 61
  • 4
1

Imho, If the graph could be topologically sorted, the very straightforward solution exists.

  1. For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (p.e. if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334})
  2. For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N); I think the complexity of this procedure is around (N*N) in the worst case. If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part)
  • 2
    This is what I did, however a topological sort is not unique. For example, if there are two disconnected components their nodes can be interleaved in any way. – Neil G Jan 28 '13 at 15:47
1

I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower.

A unique representation of the graph can be given by the neighbourhood list. For each vertex create a list with all it's neighbours. Write all the lists one after the other appending the number of neighbours for each list to the front. Also keep the neighbours sorted in ascending order to make the representation unique for each graph. So for example assume you have the graph:

1->2, 1->5
2->1, 2->4
3->4
5->3

What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}), here the curly brackets being only to visualize the representation, not part of the python's syntax. So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3).

Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. I suggest you do something like a lexicographical sort to do that. Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn), Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch.

In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N}. The neighbourhood list for a given vertex is a letter over this alphabet e.g. {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5.

The alphabet(set of possible letters) for the set {1,2,3} would be(ordered lexicographically):

{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}

First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices.

Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N. Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N. This is to show that the hash code grows really fast with the increase of N. Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique.

There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N}. Here is the code I have written for coding/decoding a subset in number and vice versa. It is written in C++ but quite easy to understand I hope. For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think:

typedef long long ll;

// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination. 
ll code(vector<int> a,int n)
{
    sort(a.begin(),a.end());  // not needed if the set you pass is already sorted.
    int cur = 0;
    int m = a.size();

    ll res =0;
    for(int i=0;i<a.size();i++)
    {
        if(a[i] == cur+1)
        {
            res++;
            cur = a[i];
            continue;
        }
        else
        {
            res++;
            int number_of_greater_nums = n - a[i];
            for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
                res += 1LL << (number_of_greater_nums+increment);
            cur = a[i];
        }
    }
    return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the 
// combination
vector<int> decode(ll kod, int n)
{
    vector<int> res;
    int cur = 0;

    int left = n; // Out of how many numbers are we left to choose.
    while(kod)
    {
        ll all = 1LL << left;// how many are the total combinations
        for(int i=n;i>=0;i--)
        {
            if(all - (1LL << (n-i+1)) +1 <= kod)
            {
                res.push_back(i);
                left = n-i;
                kod -= all - (1LL << (n-i+1)) +1;
                break;
            }
        }
    }
    return res;
}

Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. All possible hashes of graphs with 64 nodes will be (2^64)^64. This number has about 1280 digits so maybe is a big number. Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices.

Also have a look at this question.

Community
  • 1
  • 1
Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
  • Two isomorphic graphs will have different hashes. As long as the neighbourhood lists for at least one node differ, the hash will differ. – Ivaylo Strandjev Jan 28 '13 at 15:55
  • Isn't that just hashing the adjacency dictionary? – Neil G Jan 28 '13 at 15:59
  • A list in fact, but yes. So the vertices are numbered and ordered according to their number, and the lists for each vertex is printed in sequence. – Ivaylo Strandjev Jan 28 '13 at 16:00
  • Okay. Take a look at my edits to the question. I could have just omitted the topological sort and hashed the adjacency dictionary. The problem with that is that I am using the dag as a key for dynamic programming. The more that isomorphic graphs hash to the same value the less work is redone. – Neil G Jan 28 '13 at 16:02
1

I'm not sure that it's 100% working, but here is an idea:

Let's code a graph into a string and then take its hash.

  1. hash of an empty graph is ""
  2. hash of a vertex with no outgoing edges is "."
  3. hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (e.g. ",")

To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (e.g. in lexicographical order).

For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots).

edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. That happens when a vertex has several parents - then it "duplicated" for every parent. For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}.

I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily:

THash GetHash(const TGraph &graph)
{
    return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
    std::vector<std::string> childHashes;
    for(auto c:graph.GetChildren(vertex))
        childHashes.push_back(GetVertexStringCode(graph,*c));
    std::sort(childHashes.begin(),childHashes.end());
    std::string result=".";
    for(auto h:childHashes)
        result+=*h+",";
    return result;
}
maxim1000
  • 6,297
  • 1
  • 23
  • 19
  • Interesting. You could try it and see if it beats my solution by changing the hash function, and then see how many lines your version prints. – Neil G Jan 28 '13 at 17:51
  • Unfortunately I'm not fluent enough in Python to insert relevant code in your solution. The best I can is to propose an illustrating C++ code. – maxim1000 Jan 28 '13 at 20:19
  • 1
    I'am afraid that it does not work for graphs that are not trees, e.g. "diamond" (A->B->D, A->C->D) will have the same hash as the tree (A->B->D, A->C->E) You can also have a reversed tree, where every vertex can have more parents but only one child, or a mix. image http://en.wikipedia.org/wiki/Directed_acyclic_graph – hynekcer Jan 29 '13 at 23:48
  • @hynekcer, thanks. Indeed this is a collision of the hash (much more probable than for MD5 :) ). I've added a note into my answer, unfortunately I saw no quick way to fix things (without complicating the algorithm too much). – maxim1000 Jan 30 '13 at 07:02
  • I can not test your algorithm. You know it better and you can confirm or reject the problem that I guess. I do not mean collisions of MD5, SHA... but the same input into them. Collisions of good hashing algorithms are acceptable because _ every _ small change of the object modifies its hash value and it should not be trivial to find a collision. Joining two branches into one is a small change. – hynekcer Jan 30 '13 at 19:02
  • @hynekcer, I used term collision for the entire hashing process and you are right - they are not only more probable, but also happen for small changes, so this hash has much worse properties than e.g. MD5 – maxim1000 Jan 31 '13 at 12:49
1

When I saw the question, I had essentially the same idea as @example. I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs.

This tag consists of the sequence of out-degrees in ascending order. You can hash this tag with the string hash function of your choice to obtain a hash of the graph.

Edit: I expressed my proposal in the context of @NeilG's original question. The only modification to make to his code is to redefine the hashkey function as:

def hashkey(self): 
    return tuple(sorted(map(len,self.lt.values())))
naitoon
  • 665
  • 5
  • 14
  • Nice, did you try it with my code to see if it performs better? – Neil G Jan 30 '13 at 02:36
  • To be honest, I didn't understand your code. Can you try this function, or comment your code a bit more? Thanks. – naitoon Jan 30 '13 at 03:08
  • 1
    `self.lt` is a map from node to descendants in `Graph`. By the way, something is wrong with your component count since a single component can have many nodes without any in-edges. – Neil G Jan 30 '13 at 03:51
  • @NeilG Thanks, the component count was absolutely wrong. I edited the answer and now I only use the out-degrees. I will be more careful next time. – naitoon Jan 30 '13 at 12:33
  • @NeilG I've included this hash strategy inside your code and I made timing tests, but I see no improvement (I see almost the same execution time, just under the minute). Are you sure there are lots of isomorphic graphs along the execution? BTW, the only change I did was `def hashkey(self): return tuple(sorted(map(len,self.lt.values()))) `. – naitoon Jan 30 '13 at 13:27
  • `hashkey` actually returns a unique "certificate" for each family of isomorphic graphs. Did you count how many lines were printed to see if it was different? The line count measures the number of graphs considered. – Neil G Feb 01 '13 at 20:57
  • @NeilG, I obtain 15594 lines both for your code and for my code. Are you sure that along your experiment you encounter at least two isomorphic graphs? The result I get imply that this is not the case. Please reproduce the results on your side for cross checking. It just consists of changing the one line defining the `hashkey` function. – naitoon Feb 04 '13 at 01:25
  • I get the same number of lines. It may be that for seven nodes and the other graph constraints, your tag is sufficient. You could try it with 9, but I'm afraid mine runs too long to try it on my end. However, I clearly prevent many isomorphic graphs, which you can check by simply altering your memoize decorator to spit out a symbol when it can use its cache. – Neil G Feb 04 '13 at 01:32
1

I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. This proposal is therefore based on isomorphism only.

For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. Here is a starter list:

  1. 2d histogram of nodes' in and out degrees.
  2. 4d histogram of edges a->b where a and b are both characterized by in/out degree.

Addition Let me be more explicit. For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I,O values), signifying that there are N nodes with in-degree I and out-degree O. You'd hash this set of triples or better yet use the whole set arranged in some canonical order e.g. lexicographically sorted. For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO, to nodes with bI and bO respectively. Again hash these quintuples or else use them in canonical order as-is for another part of the final hash.

Starting with this and then looking at collisions that still occur will probably provide insights on how to get better.

Gene
  • 46,253
  • 4
  • 58
  • 96
0

With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification.

Example code in this StackOverflow answer, the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent.

Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children.

Community
  • 1
  • 1
Vatine
  • 20,782
  • 4
  • 54
  • 70
  • 1
    As you imply, the algorithm in the link question needs a uniquely identifiable root node. Otherwise it can hash isomorphic graphs differently. – alexis Jan 28 '13 at 19:02
  • 1
    @alexis If you don't have a uniquely identifiable root, you can synthetically make one, with all roots as children, though. – Vatine Jan 29 '13 at 09:23