16

I'm using Python 2.5 on Linux, in multiple parallel FCGI processes. I use

    chars = string.ascii_letters + string.digits
    cookie = ''.join([random.choice(chars) for x in range(32)])

to generate distinct cookies. Assuming that the RNG is seeded from /dev/urandom, and that the sequence of random numbers comes from the Mersenne twister, I would expect that there is practically zero chance of collision.

However, I do see regular collisions, even though only a few (<100) users are logged in at any time.

Why are the random numbers not more random?

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • 4
    What is chars? If you have a single character in there you'll always have collisions (to illustrate the point) – Vinko Vrsalovic Sep 02 '09 at 06:08
  • what is the length of chars list? – Nick Dandoulakis Sep 02 '09 at 06:25
  • I've added my definition of chars now - it's not a single character, but has 62 choices. – Martin v. Löwis Sep 02 '09 at 07:03
  • If you're out to solve the problem, then why not use the UUID module? Calling uuid4() for a random ID would be sufficient. – Evan Fosmark Sep 02 '09 at 07:05
  • I have solved the problem (hopefully), by using os.random - which is also one of the option that uuid4 uses; another option is to use random.randrange, in which case I wonder whether uuid4 would generate unique IDs. My question really is why my code doesn't work. – Martin v. Löwis Sep 02 '09 at 07:10
  • You should always post code demonstrating the problem when possible. – Glenn Maynard Sep 02 '09 at 07:13
  • Perhaps (pure guesswork) if you have multiple processes, and process A gets seeded with `124` and process B gets seeded with `130`, then after 6 or so entries process A will start overwriting entries created by process B. That is, if I understand your program's structure at all. – Chris Lutz Sep 02 '09 at 07:18
  • @Chris: that could be an explanation. However, (IIUC) the Mersenne Twister is seeded with reading 16 bytes from /dev/urandom, which should give every process a different seed. – Martin v. Löwis Sep 02 '09 at 07:32
  • 1
    Generating cookies with os.random is probably a bad idea--if you generate a bunch of cookies at once, it'll block until it receives more entropy; even if your traffic is so low that this isn't a problem, it's an easy DOS attack. In almost all applications, os.urandom is fine for this. – Glenn Maynard Sep 02 '09 at 07:57

5 Answers5

13

It shouldn't be generating duplicates.

import random
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
def gen():
    return ''.join([random.choice(chars) for x in range(32)])

test = [gen() for i in range(100000)]
print len(test), len(set(test)) # 100000 100000

The chances of duplicates is significant with chars = "ab"; 126 duplicates in 1000000 iterations. It's nonexistant with 62.

That said, this isn't a good way to generate cookies, because session cookies need to be unpredictable, to avoid attacks involving stealing other people's session cookies. The Mersenne Twister is not designed for generating secure random numbers. This is what I do:

import os, hashlib
def gen():
    return hashlib.sha1(os.urandom(512)).hexdigest()

test = [gen() for i in range(100000)]
print len(test), len(set(test))

... which should be very secure (which is to say, difficult to take a string of session cookies and guess other existing session cookies from them).

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • Why is the Mersenne Twister unsuitable for generating secure cookies? It has a period of 2**19937, so it shouldn't be possible to predict the next value even if you know a few subsequent ones. – Martin v. Löwis Sep 02 '09 at 07:17
  • 3
    From Wikipedia: "The algorithm in its native form is not suitable for cryptography (unlike Blum Blum Shub). Observing a sufficient number of iterates (624 in the case of MT19937) allows one to predict all future iterates." (http://en.wikipedia.org/wiki/Mersenne_twister) – Chris Lutz Sep 02 '09 at 07:19
  • 10
    Just because a random number generator has a long cycle doesn't imply that it's difficult to take a sequence within the cycle and figure out where it is. If I give you the sequence 0, 1, 2, 3..., it has a very long cycle (infinite), yet it's trivial to figure out what the next value is. You need a sequence that's cryptographically secure--where it's difficult to determine the state of the engine from its output. That's what secure hashes are for. I prefer hashing urandom through SHA-1, but hashing MT through SHA-1 is probably fine, too. – Glenn Maynard Sep 02 '09 at 07:22
  • @Chris: thanks, that clarifies it. @Glenn: why do you hash urandom? Isn't *that* unpredictable already? – Martin v. Löwis Sep 02 '09 at 07:29
  • @Martin, urandom isn't secure. Hashing it with SHA-1 helps, but isn't bullet proof. – orip Sep 02 '09 at 07:40
  • @orip: why is urandom not secure? – Martin v. Löwis Sep 02 '09 at 07:45
  • It's not secure if the entropy pool is exhausted. – ilya n. Sep 02 '09 at 07:48
  • 1
    In practice, I think urandom internally uses SHA-1 to generate data, at least on Linux. The output of urandom *should* be cryptographically strong on its own. It's just very simple and cheap to ensure this explicitly by hashing data directly. – Glenn Maynard Sep 02 '09 at 07:53
  • It's secure as far as you trust SHA-1, which for most people is very far. – Glenn Maynard Sep 02 '09 at 07:54
  • 2
    I just checked: on Linux, /dev/urandom indeed hashes its output. – Martin v. Löwis Sep 02 '09 at 08:14
  • Apparently, it's not really possible to resolve the mystery. Perhaps I'm missing a detail that I haven't communicated. So I'm accepting this answer for being most helpful (along with the comments). – Martin v. Löwis Sep 03 '09 at 04:10
  • If you want to dig deeper: log session creation; find out what time the clashing sessions are created and by which processes; verify that the state of the PRNG at the start (after first call and seeding) is distinct in each process; see if it goes away if you create a separate PRNG instance just for session creation (should not be needed but would be a data point); see if you can isolate the problem reproducibly. – Glenn Maynard Sep 03 '09 at 05:11
  • 1
    Considering the amount of progress that has been made against SHA-1 (and MD5), I'd start looking to other hashes. Neither MD5 nor SHA-1 should be used in new designs. – derobert Sep 06 '09 at 05:44
4

This is definitely not a normal collision scenario:

  • 32 characters with 62 options per character is equivalent to 190 bits (log2(62) * 32)
  • According to the birthday paradox, you should be receiving a collision naturally once every 2**95 cookies, which means never

Could this be a concurrency issue?

  • If so, use different random.Random instances for each thread
  • Can save these instances in thread-local storage (threading.local())
  • On linux, Python should seed them using os.urandom() - not system time - so you should get different streams for each thread.
orip
  • 73,323
  • 21
  • 116
  • 148
1
  1. I don't know how your FCGI processes are being spawned, but is it possible that it's using fork() after the Python interpreter has started (and the random module has been imported by something), hence effectively seeding two processes' random._insts from the same source?

  2. Maybe put some debugging in to check that it is correctly seeding from urandom, and not falling back to the less rigorous time-based seed?

eta re comment: man! That's me stumped then; if the RNG always has different state at startup I can't see how you could possibly get collisions. Weird. Would have to put in a lot of state logging to investigate the particular cases which result in collisions, I guess, which sounds like a lot of work trawling through logs. Could it be (1a) the FCGI server usually doesn't fork, but occasionally does (maybe under load, or something)?

Or (3) some higher-level problem such as a broken HTTP proxy passing the same Set-Cookie to multiple clients?

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Thanks for the ideas: 1. I have dumped the state of the RNG at startup, and they are all different. 2. I had it create files good (uses urandom) and bad (uses time); the "good" file was created; the bad file was not. – Martin v. Löwis Sep 03 '09 at 04:09
0

I had to erase my original answer, which suggested that generator is not seeded from /dev/urandom, since its source (for Python 3.x) clearly says that it is:

def seed(self, a=None):
    """Initialize internal state from hashable object.

    None or no argument seeds from current time or from an operating
    system specific randomness source if available.

    If a is not None or an int or long, hash(a) is used instead.
    """

    if a is None:
        try:
            a = int(_hexlify(_urandom(16)), 16)
        except NotImplementedError:
            import time
            a = int(time.time() * 256) # use fractional seconds

    super().seed(a)
    self.gauss_next = None

I therefore humbly accept that there are mysteries in the world that I may not be able to decipher.

ilya n.
  • 18,398
  • 15
  • 71
  • 89
  • 1
    Where do you see it seeded from some hash()? In random.py, around line 108, it's seeded from long(_hexlify(_urandom(16)), 16). – Martin v. Löwis Sep 02 '09 at 07:35
  • 1
    If you're really looking into that -- perhaps the next step would be checking that the line `a = int(_hexlify(_urandom(16)), 16)` doesn't raise `NotImplementedError` for some strange reason? – ilya n. Sep 02 '09 at 07:51
  • 1
    Thanks for the idea. Alas, I tried, and it all works as it should (i.e. it does get seeded from urandom). – Martin v. Löwis Sep 02 '09 at 18:28
-4

To avoid the problem, you can use a sequence of cookies, that are guaranteed to be different (you can e.g. use a set). Each time you give a cookie to someone, you take it from the sequence and you add another to it. Another option is to generate a UUID and use that as a cookie.

Another way to avoid the problem could be to hold a private key, and use a (e.g. MD5) checksum of the private key, with a counter value joined to it. The probability for collisions will then be very low. To be safer, add a few more variables to the checksum, like the current time, the ip address of the user, ...

Libraries to generate cookies exist. Any WSGI implementation probably contains a cookie generator.

If you're only interested in how random your strings are, you could generate a file with, say, one million cookies and perform randomness checks on that file. This, however, is not what I would recommend.

pvoosten
  • 3,247
  • 27
  • 43
  • This isn't really my question - I don't want a work-around; I want to understand what's happening. My work-around is to use os.urandom. As for using sequences - that would be bad, since the cookie can be guessed. Using uuids: if the UUID generator uses the random module, they might not be unique. – Martin v. Löwis Sep 02 '09 at 07:08
  • A UUID *can't* be guaranteed to be unique. For theoretical reasons, because there are only 2**128 of them, and for practical reasons, because perhaps the code generating them is flawed - in particular if it is very similar to the code I posted, which should also generate unique values, but doesn't. – Martin v. Löwis Sep 02 '09 at 07:26
  • Better use someone else's "flawed" code that might get better in the future than trying your own stuff, of which you don't know what it actually does. – pvoosten Sep 02 '09 at 07:31
  • 1
    With only 2**128 UUIDs, you can assign a unique UUID to every single sand grain on the planet. So I guess for at most a thousand users you will not be in trouble with UUIDs. – pvoosten Sep 02 '09 at 07:40
  • Better to use something that's flawed which you don't understand, because it might get better? Righto... – Glenn Maynard Sep 02 '09 at 07:41