12

From the python docs, "set.pop() remove and return an arbitrary element from s". While generating some random data to test a program, I noticed strange behavior of this pop() function. Here is my code (python 2.7.3):

testCases = 10
numberRange = 500

poppedValues = []
greaterPercentages = []

for i in range (testCases):
    s = Set()

    """ inserting 100 random values in the set, in the range [0, numberRange) """
    for j in range (100):
        s.add(random.randrange(numberRange)) 

    poppedValue = s.pop()
    greaterCount = 0

    """ counting how many numbers in the set are smaller then the popped value """
    for number in s:
        if poppedValue > number:
            greaterCount += 1

    poppedValues.append(poppedValue)
    greaterPercentages.append(float(greaterCount) / len(s) * 100)

for poppedValue in poppedValues:
    print poppedValue, '\t',

print

for percentage in greaterPercentages:
    print "{:2.2f}".format(percentage), '\t',

What I'm doing here is,

  1. Inserting some random values in the set s where each element is in the range [0, numberRange)
  2. Pop an element from the set (according to the docs, it should be a random one)
  3. Counting how many elements in the set are smaller then the popped value

I expected that the popped value should be a random one and about 50% of the numbers in the set will be greater then the popped value. But seems that pop() almost always returns the lowest number in the set. Here are the result for numberRange = 500. First row denotes the values of the popped element. Second row is the percentage of elements which are smaller then the popped value.

9   0   3   1   409     0   1   2   4   0   
0 % 0 % 0 % 0 % 87 %    0 % 0 % 0 % 0 % 0 %

I've conducted this test with different values of numberRange. It seems that for lower values of the set elements, pop() almost always returns the lowest element. But for higher values it returns a random element. For numberRange = 1000, the result is:

518     3586    3594    4103    2560    3087    4095    3079    3076    1622    
7 %     72 %    73 %    84 %    54 %    51 %    79 %    63 %    67 %    32 %

which I think is pretty random. Why this strange behavior? Am I doing something wrong?

EDIT: Thanks for everyone's answer and comment, seems that by "arbitrarily", it isn't guaranteed that it will be "random".

Rafi Kamal
  • 4,522
  • 8
  • 36
  • 50
  • 5
    It's not random, it's unordered. – Matthias Jan 09 '14 at 10:13
  • 6
    By "arbitrary" the docs don't mean "random", they mean "don't rely on any particular value, implementation details could change without warning" – wim Jan 09 '14 at 10:14
  • Random does not mean that it is very well distributed or even unpredictable. It means that you cannot rely on any observation to be true in the future. – Alfe Jan 09 '14 at 10:14
  • http://stackoverflow.com/questions/2860339/can-pythons-set-absence-of-ordering-be-considered-random-order – Ignacio Vazquez-Abrams Jan 09 '14 at 10:16
  • Try `random.choice(list(s))` if you want that – wim Jan 09 '14 at 10:18
  • @wim thanks, actually I tried `random.choice(s)` first, but as set doesn't allow indexing it raised an error. I'll try your method now. – Rafi Kamal Jan 09 '14 at 10:24
  • "by "arbitrarily", it isn't guaranteed that it will be "random"" -- it isn't even *suggested* that it would be random. You should disconnect the concepts in your mind. "Arbitrary" permits any means of choosing an element to pop, including doing so randomly, but it would be most bizarre for an implementation to do work to select at random. – Steve Jessop Jan 09 '14 at 10:56
  • 3
    Yes and IMHO it sucks that `random.choice(s)` doesn't work for sets. Furthermore `random.choice` is downright BROKEN for dicts, [see here for more details about that](http://stackoverflow.com/q/16431703/674039). – wim Jan 09 '14 at 11:04
  • @wim: the latter is just duck-typing. Treating a `dict` as a sequence just because it has `len` and `getitem` is downright broken. Of course it's unfortunate that the brokenness manifests as sometimes working when the `dict` has integer keys. If ABCs had been in Python from the start then standard functions could check whether or not an alleged duck really is a duck, or a chicken in a bathing suit. But LBYL and explicit indicators of interface implementation just aren't the Python way. – Steve Jessop Jan 10 '14 at 18:24
  • You are right, but to someone who is just using python (and doesn't necessarily know or need to know about the implementation details of these data structures) the behaviour of `random.choice` is just a bit odd. Someone shouldn't have to know about ABCs and `collections.Sequence` etc to be able to use it without accidentally writing a bug, right? It's not exactly obvious from the docstring that it will do weird things with dict. – wim Jan 10 '14 at 22:04
  • 2
    @wim: agreed, if someone has just read the `random` documentation and doesn't understand that Python/C++ jargon word "sequence", it is not obvious that a `set` or `dict` isn't one. I don't feel strongly about the fact that terms like "sequence", "iterable", "iterator" are assumed vocabulary in various parts of the Python documentation, but no doubt it catches most people out at least once. Separately it would be nice if `random.choice` worked on an arbitrary iterable (in `O(n)` time and `O(1)` memory) and on sets and dicts in whatever time complexity is feasible. – Steve Jessop Jan 10 '14 at 22:19
  • Yeah, I thought it would be more pythonic for them to "just work" in whatever complexity is feasible, rather than break weirdly, and that's what I suggested [here](http://bugs.python.org/issue1551113). But python devs didn't agree :) – wim Jan 10 '14 at 23:16

2 Answers2

12

It's an implementation detail - set is implemented as a HashMap (similar to dict but without a slot for a value), set.pop removes the first entry in the HashMap, and an ints hash value is the same int.

Combined, this means that your set, which is ordered by the hash values, is actually ordered by the entries modulo hashtable size as well; this should be close to natural ordering in your case as you are only inserting numbers from a small range - if you take random numbers from randrange(10**10) instead of randrange(500) you should see a different behaviour. Also, depending on your insertion order, you can get some values out of their original hashing order due to hash collisions.

l4mpi
  • 5,103
  • 3
  • 34
  • 54
  • 2
    Your second paragraph is incorrect. It's only true when the inserted values are small relative to the set size. All bets are off with larger values. – interjay Jan 09 '14 at 10:26
  • @interjay yeah you're right, editing atm - it's of course ordered by modulo of the hashtable size; but as OP is inserting numbers from a small range it should be mostly true for his case. – l4mpi Jan 09 '14 at 10:38
5

When the doc says:

remove and return an arbitrary element from s; raises KeyError if empty

that means the behaviour isn't defined and the implementation can do whatever it's possible. In this case, it seems that the implemented behaviour is to remove the smallest value. That's all.
In fact, set.pop() is based on a HashMap and remove the first element of this (which is the smaller hashcode). In the case of set of ints, it's the smallest int.

On other implementation of Python could return a random value or the first pushed... You can't know.

Maxime Lorant
  • 34,607
  • 19
  • 87
  • 97
  • It's an implementation detail caused by the implementation of `set` as a HashMap and the hash value for ints – l4mpi Jan 09 '14 at 10:16