2

When I re-insert the same float value into my set a few times, the x in s check that should take constant time becomes very slow. Why?

Output of timing x in s:

   0.06 microseconds
   0.09 microseconds
   0.16 microseconds
   0.56 microseconds
   1.00 microseconds
   1.58 microseconds
   2.55 microseconds
   5.98 microseconds
  10.50 microseconds
  24.54 microseconds
  40.39 microseconds
  96.60 microseconds
 160.24 microseconds
 419.08 microseconds
 732.27 microseconds

Code (Try it online!):

from timeit import timeit

s = {float('nan')}
for _ in range(15):
    for _ in [*s]:
        x = float('nan')
        s.add(x)
    time = timeit('x in s', number=1000, globals=globals()) * 1e3
    print('%7.2f microseconds' % time)
no comment
  • 6,381
  • 4
  • 12
  • 30
  • Related: [Causes for inconsistent behavior when adding NaNs to a set](https://stackoverflow.com/q/51622065/674039) – wim Sep 03 '21 at 02:05
  • 1
    You may be interested to know that this is "fixed" in Python 3.10: with your code, I get 433.57µs from the final iteration when running under Python 3.9.9, and 0.04µs under Python 3.10.1. See https://bugs.python.org/issue43475 for more. – Mark Dickinson Dec 25 '21 at 17:39

1 Answers1

7

Because you are using nan, which is notorious for breaking naive expectations regarding __hash__/__eq__ contract... i.e.:

>>> myset = set()
>>> myset.add(float('nan'))
>>> myset
{nan}
>>> myset.add(float('nan'))
>>> myset
{nan, nan}

This happens because:

>>> float('nan') == float('nan')
False

But:

>>> hash(float('nan')) == hash(float('nan'))
True

So you are guaranteed to collide every time, and you are seeing hash set behavior degrade to O(N), which is the worst-case behavior, not O(1). Fundamentally, you are not re-inserting the same float value.

Moreover, watch out for this behavior:

>>> nan = float('nan')
>>> myset = set()
>>> myset.add(nan)
>>> myset.add(nan)
>>> myset
{nan} 

Despite:

>>> nan == nan
False

The above is due to an optimization, that for containers, Python actually checks identity first to avoid potentially costly __eq__ operations. Since I re-used the same object, now it is being considered "the same value".

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Exactly, you've turned the (degenerate) hash set into a linked list – Alexander Sep 01 '21 at 16:21
  • @Alexander yes, although to be clear I'm pretty sure Python hash-based container use open addressing for collision resolution – juanpa.arrivillaga Sep 01 '21 at 16:22
  • Ah, then it's a linear array scan (roughly). Better, but still O(n), like you mentioned – Alexander Sep 01 '21 at 16:23
  • What's the contract that is broken? – no comment Sep 01 '21 at 16:24
  • I see *"The only required property is that objects which compare equal have the same hash value"* [here](https://docs.python.org/3/reference/datamodel.html#object.__hash__). And nan doesn't break that. – no comment Sep 01 '21 at 16:28
  • @don'ttalkjustcode that objects with the "same value" (i.e. "equal each other"). have the same hash value. – juanpa.arrivillaga Sep 01 '21 at 16:28
  • @don'ttalkjustcode alternatively, you can also just realize that NaN is not the same value as NaN, by *definition*, NaN is not the same value *as any float*. So it isn't breaking the contract exactly, but the common expectation that NaN is the same value as NaN. – juanpa.arrivillaga Sep 01 '21 at 16:29
  • But they *do* have the same hash value, as you showed. And they don't even have to, as they *don't* compare equal. So that's rather doubly not breaking the contract. – no comment Sep 01 '21 at 16:30
  • Hmm, ok. I still don't think it's about that contract, but rather just about the expectation that nan equals itself. Btw, this was inspired by a comment saying it [involves no looping](https://stackoverflow.com/questions/68996444/in-operator-functionality-in-python/68997344#comment121940473_68996444). – no comment Sep 01 '21 at 16:36
  • @don'ttalkjustcode yeah, that's what I reworded it to. – juanpa.arrivillaga Sep 01 '21 at 16:36
  • 1
    One more thing: Of course I was sure you knew it can degrade, just took your comment as challenge since I hadn't actually seen it degrade before. And then thought it's interesting enough to share, and maybe useful to reference ([didn't take long](https://stackoverflow.com/questions/69017839/69020401#comment121979563_69017911)). I'm not happy having resorted to the nan hammer, though. Might try to reverse-engineer int hashing to get a set of let's say 96-bit ints with equal hashes. Or do you have another idea to create equal-hash sets? (Other than custom class, of course.) – no comment Sep 01 '21 at 21:04
  • @don'ttalkjustcode I'm not sure what this is supposed to demonstrate about my comment. – juanpa.arrivillaga Sep 01 '21 at 21:07
  • I mean they talked about a "loop behind the scene" and you responded that it doesn't involve looping, but it does. Just not the kind of loop that they and you meant. – no comment Sep 01 '21 at 21:11
  • 1
    Finding equal-hash ints was easier than I expected: `all(hash(i*(2**61-1)) == 0 for i in range(10**6))`. – no comment Sep 01 '21 at 23:59