1

I have this class. The goal of this class is to encapsulate the logic for custom set operations:

class Reach(object):

    def __init__(self, device_id=None, user_id=None):
        self.device_id = device_id
        self.user_id = user_id

    def __eq__(self, other):
        if isinstance(other, Reach):
            # print(self.device_id, self.user_id)
            # print(other.device_id, other.user_id)
            return self.device_id == other.device_id or \
                    self.user_id == other.user_id
        return False

    def __hash__(self):
        return hash(f"device_id={self.device_id}&user_id={self.user_id}")

This is my test case:

if __name__ == "__main__":

    # Basic equality test
    p1 = Reach(device_id="did1")
    p2 = Reach(device_id="did1", user_id="tauid1")
    assert p1 == p2
    s1 = set([p1])
    s2 = set([p2])
    assert len(s1 & s2) == 1                # Failing here 

    # Transmutivity test
    p1 = Reach(device_id="did1")
    p2 = Reach(device_id="did1", user_id="tauid1")
    p3 = Reach(user_id="tauid1")
    assert p1 != p3                                   
    assert p1 == p2 == p3
    s1, s2, s3 = set([p1]), set([p2]), set([p3])
    assert len(s1 & s3) == 0
    assert len(s1 | s3) == 2
    assert len(s1 & s2) == 1
    assert len(s2 & s3) == 1
    assert len(s1 & s2 & s3) == 1
    assert len(s1 | s2 | s3) == 1

I realized that since python set operations uses hash, assert len(s1 & s2) == 1 fails because it's comparing the hash of device_id=did1&user_id=None with device_id=did1&user_id=tauid1 which is not the same

What do I need to change to achieve my desired effect?

Tinker
  • 4,165
  • 6
  • 33
  • 72
  • 3
    This seems smelly to me. What if you have a set containing `[(a, b), (c, b), (c, d)]`? If `(a, b)` hashes to the same as `(c, b)`, and `(c, b)` hashes to the same as `(c, d)`... I think you can use induction to prove that *all* items should have the same hash. – 0x5453 Jan 19 '22 at 21:27
  • 1
    Expanding my example, removing `(c, b)` would have to effectively change the hash values of `(a, b)` and/or `(c, d)`. And having a hash function that can change as a side-effect of some other operation is certainly a bad idea. – 0x5453 Jan 19 '22 at 21:33
  • Minor note regarding your question's title: `__hash__` is not going to return True (it returns an integer). It's the `__eq__` method that would return True if two objects are equal. – jarmod Jan 19 '22 at 21:46
  • Regarding the title, the hash function *never* returns True; it returns an integer. Computing a hash for an object depends on that object only, not its relationship to any other object. – chepner Jan 19 '22 at 21:47
  • Related (dupe?) https://stackoverflow.com/questions/69396676/how-to-implement-hash-for-an-object-with-multiple-comparable-properties – JL Peyret Jan 19 '22 at 23:01

1 Answers1

2

This is easily accomplished, but the result is pretty much useless. Suppose we have four strings a, b, x, and y.

You want Reach(a, b) to have the same hash as Reach(a, y), because the objects compare equal via your __eq__, and for a hash to be of any use at all object equality has to imply hash equality.

But for the same reason, you also need the hashes of Reach(a, y) and Reach(x, y) to be equal.

So for any four strings, you need

    hash(Reach(a, b)) == hash(Reach(x, y))

The only way to achieve that is with a hash function that's a constant, ignoring its argument. Like

def __hash__(self):
    return 12345

That will "work", but set operations will degenerate into extra-slow versions of linear search (all objects will map to the same hash bucket, and the same collision chain).

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • Actually this is exactly what i need. I am now failing the very last test`assert len(s1 | s2 | s3) == 1` probably because the end product of `s1 & s2` is `p1` rather than `p2`. Is there a way to force the set class to pick `p2` over `p1`? – Tinker Jan 19 '22 at 22:30
  • 1
    Nope, there's no way to force any internal decision here. If I were you ;-), I'd rethink the base approach. As is, your definition of equality is nearly impossible to reason about. For example, you have `p1 == p2` and `p2 == p3`, but _not_ `p1 == p3`, a basic failure of transitivity. And, of course, you also want `len(s1 | s2 | s3) == 1` despite that you _also_ want `len(s1 | s3) == 2`. It's extremely surprising that folding another (`s2`) set into a union would _decrease_ the union's size. But, if that's what you think you want, you'll be best off implementing your own set type. – Tim Peters Jan 19 '22 at 22:49
  • Thank you @Tim Peters for your help. This is actually being used in advertising to count the number of overlaps between two sets of users. Some users will have device_ids, some with user_ids and some with both. – Tinker Jan 19 '22 at 23:01