7

I have a list of custom objects from which I want to remove the duplicates. Normally, you would do this by defining both __eq__ and __hash__ for your objects and then taking the set of the list of objects. I have defined __eq__, but I can't figure out a good way to implement __hash__ such that it returns the same value for objects that are equal.

More specifically, I have a class that is derived from the Tree class from the ete3 toolkit. I have defined two objects to be equal if their Robinson-Foulds distance is zero.

from ete3 import Tree

class MyTree(Tree):

    def __init__(self, *args, **kwargs):
        super(MyTree, self).__init__(*args, **kwargs)

    def __eq__(self, other):
        rf = self.robinson_foulds(other, unrooted_trees=True)
        return not bool(rf[0])

newicks = ['((D, C), (A, B),(E));',
           '((D, B), (A, C),(E));',
           '((D, A), (B, C),(E));',
           '((C, D), (A, B),(E));',
           '((C, B), (A, D),(E));',
           '((C, A), (B, D),(E));',
           '((B, D), (A, C),(E));',
           '((B, C), (A, D),(E));',
           '((B, A), (C, D),(E));',
           '((A, D), (B, C),(E));',
           '((A, C), (B, D),(E));',
           '((A, B), (C, D),(E));']

trees = [MyTree(newick) for newick in newicks]

print len(trees)       # 12
print len(set(trees))  # also 12, not what I want!

Both print len(trees) and print len(set(trees)) return 12, but that is not what I want, because several of the objects are equal to each other:

from itertools import product
for t1, t2 in product(newicks, repeat=2):
    if t1 != t2:
        mt1 = MyTree(t1)
        mt2 = MyTree(t2)
        if mt1 == mt2:
            print t1, '==', t2

which returns:

((D, C), (A, B),(E)); == ((C, D), (A, B),(E));
((D, C), (A, B),(E)); == ((B, A), (C, D),(E));
((D, C), (A, B),(E)); == ((A, B), (C, D),(E));
((D, B), (A, C),(E)); == ((C, A), (B, D),(E));
((D, B), (A, C),(E)); == ((B, D), (A, C),(E));
((D, B), (A, C),(E)); == ((A, C), (B, D),(E));
((D, A), (B, C),(E)); == ((C, B), (A, D),(E));
((D, A), (B, C),(E)); == ((B, C), (A, D),(E));
((D, A), (B, C),(E)); == ((A, D), (B, C),(E));
((C, D), (A, B),(E)); == ((D, C), (A, B),(E));
((C, D), (A, B),(E)); == ((B, A), (C, D),(E));
((C, D), (A, B),(E)); == ((A, B), (C, D),(E));
((C, B), (A, D),(E)); == ((D, A), (B, C),(E));
((C, B), (A, D),(E)); == ((B, C), (A, D),(E));
((C, B), (A, D),(E)); == ((A, D), (B, C),(E));
((C, A), (B, D),(E)); == ((D, B), (A, C),(E));
((C, A), (B, D),(E)); == ((B, D), (A, C),(E));
((C, A), (B, D),(E)); == ((A, C), (B, D),(E));
((B, D), (A, C),(E)); == ((D, B), (A, C),(E));
((B, D), (A, C),(E)); == ((C, A), (B, D),(E));
((B, D), (A, C),(E)); == ((A, C), (B, D),(E));
((B, C), (A, D),(E)); == ((D, A), (B, C),(E));
((B, C), (A, D),(E)); == ((C, B), (A, D),(E));
((B, C), (A, D),(E)); == ((A, D), (B, C),(E));
((B, A), (C, D),(E)); == ((D, C), (A, B),(E));
((B, A), (C, D),(E)); == ((C, D), (A, B),(E));
((B, A), (C, D),(E)); == ((A, B), (C, D),(E));
((A, D), (B, C),(E)); == ((D, A), (B, C),(E));
((A, D), (B, C),(E)); == ((C, B), (A, D),(E));
((A, D), (B, C),(E)); == ((B, C), (A, D),(E));
((A, C), (B, D),(E)); == ((D, B), (A, C),(E));
((A, C), (B, D),(E)); == ((C, A), (B, D),(E));
((A, C), (B, D),(E)); == ((B, D), (A, C),(E));
((A, B), (C, D),(E)); == ((D, C), (A, B),(E));
((A, B), (C, D),(E)); == ((C, D), (A, B),(E));
((A, B), (C, D),(E)); == ((B, A), (C, D),(E));

So my question is either:

  • What would be a good __hash__ implementation for my case so that set(trees) works?
  • Or how do I remove objects that are equal from my list without having __hash__ defined?
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • 4
    IMO, the safest approach to solving this and similar problems is to define a canonical representation for your data structure and then make the problematic operations use the canonical representation. – Leon Oct 10 '17 at 10:42
  • If you can guarantee that every `MyTree` instance is initialised by a string, why not save the string and return the hash of that in you own `__hash__()` method. – quamrana Oct 10 '17 at 10:44
  • 2
    @quamrana 2 `MyTree` objects initialized from different strings (with different hash values) may compare equal (as illustrated in the example) – Leon Oct 10 '17 at 10:46
  • @quamrana as i explained in my question, I am aware that `__hash__()` is used in sets in Python. My question is exactly *how* to implement a valid `__hash__()` function for my use case OR how I could work around this. – BioGeek Oct 10 '17 at 10:57
  • 2
    Have you tried simply looping through and removing duplicates? It's not the fastest algorithm but depending on how many trees you have it may not matter. – Alex Hall Oct 10 '17 at 11:38
  • Ok, I can see my understanding of the requirements of objects imposed by `set`s was wrong. From experiments, it seems that for trees, you will need to be able to transform a given tree into a standard form and take the hash of the standard form. But what that form is I can't tell you yet. Ok, that's what @Leon said. Might it be possible to sort the branches? – quamrana Oct 10 '17 at 11:48
  • @Leon using a canonical representation was a good idea, but eventually a dead end. See [this answer](https://stackoverflow.com/a/46670077/50065) for details. – BioGeek Oct 10 '17 at 15:12
  • @AlexHall I ended up using your suggestion, see [this answer](https://stackoverflow.com/a/46670077/50065) for details. – BioGeek Oct 10 '17 at 15:13
  • Not sure this is a possible approach, but maybe by representing your trees as a recursive frozenset of frozensets of leaf names. I had started an answer to https://stackoverflow.com/q/46626414/1878788 based on such a way to describe trees, but did not finish it (yet?). – bli Oct 11 '17 at 09:11
  • 2
    Finally I did it (directly using frozensets, so no need to define a custom hash): https://stackoverflow.com/a/46707857/1878788 – bli Oct 12 '17 at 11:08

0 Answers0