0

I'm trying to do a huge amount of simple "intersection" operations with integers. Unfortunately, I do not have numpy/scipy available in the setup, and I'm not able to change that.

I noticed on stackoverflow that the Python set operation nicely sorts the data, which not only speeds up loads of cases, but in my case, I'd actually like to sort the data as well, thus it would be an awesome bonus.

I'm now just afraid it does not always work, so I went to test:

import random 

one = range(100)
two = range(50)
three = range(50)

for i in xrange(1000000):
    # shuffle the lists
    random.shuffle(one)
    random.shuffle(two)    

    # do set operation  
    res = [v for v in set(one) & set(two)]
    if res != three:
        print res

The result is that all the samples are sorted (no wrong cases are printed).

While this is quite convincing, I'd like to know if there would be a case where the integers are not completely sorted when using a set intersection?

Community
  • 1
  • 1
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160

3 Answers3

3

No, it's not.

CPython's set intersection implementation works by parallel iteration over the two sets, in hash order. Matching hashes are further tested for equality.

If you have a set of small contiguous ints, they'll all hash to themselves, so everything will work out fine. But if the sets are anything else (widely spaced ints, strings, whatever), this same effect won't appear.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • This is really nicely put. I'm still wondering on the boundaries though. So far, simulating many likely cases (e.g. `set(range(100)) & set([41,42,43,44]))` do work. – PascalVKooten Jan 19 '15 at 23:41
  • Indeed, the case where the numbers are not following themselves, that's where it goes wrong (e.g. `set(range(100)) & set([41,42,43,46])`). – PascalVKooten Jan 19 '15 at 23:42
  • 1
    @PascalvKooten Whether it works or not depends on whether information is lost by the hash modulus. Basically, if the numbers are sparse enough that the number of buckets is smaller than the maximum integer, things will break. – Sneftel Jan 19 '15 at 23:45
2

A set doesn't have order so any ordering is accidental. Or, to be precise, it does have some ordering but you can't make any assumptions regarding it. If you want the result to be sorted you'll need to sort it yourself using sorted().

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • My assumption is that in 1 million out of 1 million random cases the set is indeed sorted (I'll mention it again: it's the design of the Python set). I'm just wondering if I can safely use it; basically I'm looking for a reproducible example where it fails to work. – PascalVKooten Jan 19 '15 at 23:35
  • 1
    @PascalvKooten: the ordering in a set is not random, it doesn't mean it'll be sorted one in a million times. It depends on the hashing function whether you'll ever or never get a sorted result. – Simeon Visser Jan 19 '15 at 23:41
1

Counterexamples are very easy to find if you know where to look

>>> [v for v in set(range(-10,0)) & set(range(-5,10))]
[-2, -5, -4, -3, -1]
John La Rooy
  • 295,403
  • 53
  • 369
  • 502