1

I have 40.000 IDs which are the keys in a dictionary. I need to shuffle them, with random.shuffle for example. But can I skip that step?

Dictionary doesn't store the keys with the order they come, so if I do keys = dict.keys(), then keys contains the keys in a non-ascending order. My program is only going to run once, so I do not care if the "result of the permutation" is the same among executions.

So, can I "cheat" and skip the shuffle step?


I understand that the order of keys is a bit predictable. What I am asking though is this:

What is the chance (roughly speaking) of a permutation generated by random.shuffle() to be (much) identical to the order of the keys?

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 1
    The dictionary order is hardly random - it's just undefined. You will get much better results by doing a true shuffle on it. The speed of the shuffle should be linear, so performance shouldn't be an issue. – Tom Karzes May 06 '16 at 01:48
  • `The dictionary order is hardly random - it's just undefined.`; damn an explanation on that would be nice, maybe in an answer, if it can't fit? – gsamaras May 06 '16 at 01:49
  • I suggest reading up on hash tables and hashing functions. You're probably getting the keys in hash bucket order. – Tom Karzes May 06 '16 at 01:55
  • Hmm that says much for an answer @TomKarzes, thanks! – gsamaras May 06 '16 at 01:56

2 Answers2

2

No you can't.

If you need randomness, you cannot skip the shuffling, either before entering the data in the dictionary, or after.

The reason is that although the order of keys in a dictionary is not guaranteed, there is a strong predictability as to the order they will assume based on the sequence of entry.

Entries in a dictionary are done according to the value of the hash of the key which is some very large number, modulo another large number, creating a bounded range of values. When two keys hash to the same value, a collision occurs; the key is then placed in the next available location (whichever way that is determined)

[edit]:
The chance to randomly get the keys in a roughly (much) identical order than a hash bucket is ... indeterminate.

Reblochon Masque
  • 35,405
  • 10
  • 55
  • 80
2

To elaborate on what others are saying and why you actually need to shuffle the keys. If you initialize your dictionary the same way repeatedly, it will have the same order each time. That is obviously not random. As Masque said, it's based on the hash (see this SO question Why is the order in dictionaries and sets arbitrary?).

To answer "What is the chance (roughly speaking) of a permutation generated by random.shuffle() to be (much) identical to the order of the keys?" directly: the odds that it is exactly identical to a shuffle is 1/factorial(len(yourDict)); that is because one of the permutations will result in the same ordering as your dict does on initialization. All other orderings will be different though, and there are factorial(len(yourDict)) different permutations (orderings) that could result from shuffling.

Hope that helps!

Community
  • 1
  • 1
rofls
  • 4,993
  • 3
  • 27
  • 37
  • 1
    A quibble: if the dictionary is at all large there are probably less than `factorial(len(yourDict))` possible ways to seed the random number generator, so that formula overstates what is actually possible directly using `random.shuffle()`. See this question: http://stackoverflow.com/q/34139259/4996248 Also, for fun, see this: https://www.youtube.com/watch?v=T69cguFzZ_w – John Coleman May 06 '16 at 02:15
  • Very cool. However OP says they're dealing with only 40,000 ids, and pseudorandom number generator period is stated as: `2**19937-1` for Python. So they should be safe by a long shot. – rofls May 06 '16 at 02:31
  • 1
    sum(math.log(n,2) for n in range(1,40001)) evaluates to 553809, so 40000! is more like 2**553809. Also, if it is seeded by the system clock in such a way that the shuffle is a function of the state of the system clock at the time of seeding, then the number of possible states of the system clock is miniscule compared to 40,000! (or even 52!, for that matter), which seems to suggest that `random.shuffle()` can never do more than scratch the surface of all mathematically possible shuffles. – John Coleman May 06 '16 at 02:43
  • Ahh whoops, I didn't think that through properly :) Good point – rofls May 06 '16 at 02:44
  • I don't follow the second half though; how do you conclude that there are less seeds than 52!? EDIT: Using all of these factorials makes me feel like we're yelling at each other. – rofls May 06 '16 at 02:45
  • 1
    Even if the system clock had nanosecond resolution (which is doesn't), the number of elapsed nanoseconds since big bang is essentially 0 compared to 52!. Thus, if every nanosecond since big bang was used to seed a Python random number generator and then produce a shuffle, most shuffles still wouldn't appear. – John Coleman May 06 '16 at 02:53