0

I am using the below to generate a random set of characters and numbers:

tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])

I thought that this was a decent method. 36 character length, with each character being one of 36 unique options. Should be a good amount of randomness, right?

Then, I was running a query off an instance with what I thought was a unique tag. Turns out, there were SEVEN (7) records with the same "random" tag. So, I opened the DB, and ran a query to see the repeatability of my tags.

Turns out that not only does mine show up 7 times, but there are a number of tags that repeatedly appear over and over again. With approximately 2000 rows, it clearly should not be happening.

Two questions:

(1) What is wrong with my approach, and why would it be repeating the same tag so often?

(2) What would be a better approach to get unique tags for each record?

Here is the code I am using to save this to the DB. While it is written in Django, clearly this is not a django related question.

class Note(models.Model):
    ...
    def save(self, *args, **kwargs):
        import random
        import string
        self.tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
        super(Note, self).save(*args, **kwargs)
Adam Hopkins
  • 6,837
  • 6
  • 32
  • 52
  • 4
    Your approach looks fine. I suspect you are inadvertently creating duplicate records with identical tags when writing to the DB. An alternative is to check that the tag doesn't exist in the DB before creating the record. – gtlambert Mar 01 '16 at 23:00
  • try using `uuid` instead of `random` – Jay T. Mar 01 '16 at 23:01
  • Thanks for the thoughts. I will edit the question to show my DB save. Really should not be making multiple records. And, when I see them in the DB, the records seem to have no connection and are not connected by time or anything else. – Adam Hopkins Mar 01 '16 at 23:01
  • Going forward, I will make it a `UniqueField`, but this still doesn't explain why this would happen. Hmmm..... – Adam Hopkins Mar 01 '16 at 23:04
  • 1
    I just simulated creating 2000 tags 100 different times, and I never got any repeats within a trial. I didn't mean that I didn't believe your problem, but rather that the problem isn't with the way you are generating tags. I think @gtlambert's theory is correct. Are you saving to the db anywhere else? – Garrett R Mar 01 '16 at 23:10
  • @GarrettR: A query of my database (`SELECT tag, COUNT(tag) AS occurrences FROM client_note GROUP BY tag ORDER BY occurrences DESC LIMIT 10;`) yielded the top 10 tags with occurrences of: 11, 10, 10, 9, 8, 8, 7, 7, 7, 7. – Adam Hopkins Mar 01 '16 at 23:12
  • Short answer - random is not realy random. Explanation of this behaviour you can find here: http://stackoverflow.com/a/2145551/4992248 – TitanFighter Mar 01 '16 at 23:25
  • @TitanFighter: The birthday paradox wouldn't cause this many duplicates in only 2000 rows, considering the space of possible outputs (as well as the space of possible RNG states) is many orders of magnitude greater than 2000. – user2357112 Mar 01 '16 at 23:29

2 Answers2

1

The problem with your approach:

  1. true randomness/crypto is hard, you should try to use tested existing solutions instead of implementing your own.
  2. Randomness isn't guaranteed - while 'unlikely', there's nothing preventing the same string to be generated more than once.

A better solution would be to not reinvent the wheel, and use the uuid module, a common solution to generating unique identifiers:

import uuid

tag = uuid.uuid1()
yurib
  • 8,043
  • 3
  • 30
  • 55
  • There's nothing preventing the same random uuid4 from being generated twice, either. We rely on the fact that it's very unlikely. `random.choice` uses a much weaker RNG seeded with a much weaker entropy source, but it still shouldn't be producing this many duplicates. – user2357112 Mar 01 '16 at 23:06
  • @user2357112 while nothing is actively preventing it, uuid incorporates MAC address and current time. generating the same uuid4 twice in a home machine, without really trying to, is practically impossible. – yurib Mar 01 '16 at 23:08
  • I just came across this. Had no idea this was even a field type in Django: https://docs.djangoproject.com/en/1.9/ref/models/fields/#uuidfield – Adam Hopkins Mar 01 '16 at 23:10
  • @yurib: That's UUID1. UUID4 relies solely on randomness. It has 4 bits for the version number, 2 bits reserved, and 122 random bits. No bits are used for MAC address or timestamp. Also, with modern processor speeds, UUID1 implementations need to take special countermeasures to avoid generating two UUIDs with the same timestamp on the same machine. – user2357112 Mar 01 '16 at 23:13
  • @user2357112 I stand corrected regarding the version. Still, I argue that, regardless of version, this is a much better approach for generating unique IDs than a makeshift implementation. – yurib Mar 01 '16 at 23:16
  • I've marked this as my answer, and this is what I am going forward with. Thanks @yurib. But, I am still curious as to why the other problem. I guess I will chalk it up to a bug somewhere in my code. – Adam Hopkins Mar 01 '16 at 23:17
  • @yurib: While it's certainly *better* than using `random.choice`, `random.choice` shouldn't have caused the described behavior. Blaming `random.choice` for duplicates on this scale is missing the fact that there has to be some other problem going on, and switching to UUIDs is likely to do nothing to address the other problem. – user2357112 Mar 01 '16 at 23:28
0

Use a cryptographically secure PRNG with random.SystemRandom(). It will use the PRNG of whatever system you are on.

tag = ''.join(random.SystemRandom().choice(string.ascii_letters + string.digits) for n in xrange(36))

Note that there is no need to pass this as a list comprehension to join().

There are 6236 possible combinations, a number with 65 digits, so duplicates should be extremely rare, even if you take the birthday paradox into consideration.

pp_
  • 3,435
  • 4
  • 19
  • 27