Is this password generator biased?

Question

Is there a flaw in this command to generate passwords?

head -c 8 /dev/random | uuencode -m - | sed -n '2s/=*$//;2p'

After generating a few passwords with it, I started to suspect that it tends to favor certain characters. Of course people are good at seeing patterns where there aren't any, so I decided to test the command on a larger sample. The results are below.

From a sample of 12,000 generated (12-digit) passwords, here are the most and least common letters and how many times they appear.

  TOP 10          BOTTOM 10

Freq | Char      Freq | Char
-----|-----      -----|-----
2751 | I         1833 | p
2748 | Q         1831 | V
2714 | w         1825 | 1
2690 | Y         1821 | r
2673 | k         1817 | 7
2642 | o         1815 | R
2628 | g         1815 | 2
2609 | 4         1809 | u
2605 | 8         1791 | P
2592 | c         1787 | +

So for instance 'I' appears more than 1.5 times as often as '+'.

Is this statistically significant? If so, how can the command be improved?

apparently bash isn't programming?! wtf? how can something like http://stackoverflow.com/questions/55556/password-generation-best-practice (or almost anything else in the column to the right) remain open while this is closed? — andrew cooke, Aug 23 '11 at 11:07

andrew cooke · Accepted Answer · 2011-08-23T03:48:02.713

7

yes, i think it is going to be biased. uuencode requires 3 bytes for each 4 output characters. since you are giving it 8 bytes the last byte is padding of some (non-random) kind and that is going to bias the 12th character (and slightly affect the 11th too).

can you try

head -c 9 /dev/random | uuencode -m -

(with 9 instead of 8) instead and post the results? that should not have the same problem.

ps also, you will no longer need to drop the "=" padding, since that's a multiple of 3.

http://en.wikipedia.org/wiki/Uuencoding

pps it certainly appears statistically significant. you expect a natural variation of sqrt(mean), which is (guessing) sqrt(2000) or about 40. so three deviations from that, +/-120, or 1880-2120 should contain 99% of letters - you are seeing something much more systematic.

ppps neat idea.

ooops i just realised -m for uuencode forces base64 rather than the uudecode algorithm, but the same idea applies.

edited Aug 23 '11 at 03:48

answered Aug 23 '11 at 03:28

andrew cooke

45,717
10
93
143

Interesting, I'll test that and see how it compares. – Joe Nelson Aug 23 '11 at 03:33
I was in the process of testing the first set of values when you posted this answer; I just tested your command and it appears to be uniform (p=2.2e-16 for the output of the first command, and p=0.7911 for the second, both using chi-square tests). – bnaul Aug 23 '11 at 03:36
Thanks so much Andrew, great analysis. – Joe Nelson Aug 23 '11 at 03:44

Is this password generator biased?

1 Answers1