45

I am trying to generate random base32 numbers that are 6 characters or less. This should give approximately 1 billion different combinations.

I have created a program to generate these “random” numbers. However, it appears that it generates a duplicate on average every 40,000 generations.

Why are these “random” numbers duplicating so often when there are over a billion different combinations?

Here is my code:

static void Main(string[] args)
{
    int seed = Environment.TickCount;
    Random r = new Random(seed);

    Dictionary<int, int> resultDictionary = new Dictionary<int, int>();
    for (int x = 1; x <= 1000; x++)
    {
        Dictionary<string, int> dictionary = new Dictionary<string, int>();
        try
        {
            while (true)
            {
                int rand = r.Next(0, 1073741823);
                CrockfordBase32Encoding encoding = new CrockfordBase32Encoding();
                string encodedRand = encoding.Encode((ulong)rand, false);
                dictionary.Add(encodedRand, rand);
            }
        }
        catch (Exception)
        {
        }
        Console.WriteLine(string.Format("{0} - {1}", x, dictionary.Count));
        resultDictionary.Add(x, dictionary.Count);
        x++;
    }

    Console.WriteLine();
    Console.WriteLine("Average Number Before Duplicate: " + resultDictionary.Average(x => x.Value));
    Console.ReadLine();
}
TRiG
  • 10,148
  • 7
  • 57
  • 107
jkruer01
  • 2,175
  • 4
  • 32
  • 57

3 Answers3

118

This is similar to the Birthday Problem. Given a group of n people, What is the probability that two share the same birthday1? It's higher than you'd think.

In your case, what are the odds that randomly picking a number between 0 and 1,073,741,823 n times will give you a duplicate?

One approximation from the link above is 1-exp(-(n*n)/(2*d)). If n=40,000 that equates to about a 52.5% probability that a duplicate is chosen, so seeing duplicates after 40,000 picks on average seems reasonable.


1assuming that birthdays are uniformly distributed universally, which is not the case in reality but is "close enough" and makes the math easier

D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • 10
    The first sentence should be: "What are the chances that `n` people chosen at random all have different birthdays?" – user253751 Jul 22 '14 at 07:34
  • 9
    Actually a funny psychological problem, because people expect randomness to almost never yield the same result. A variation of the birthday paradox: If you let people guess many dice-rolls, they will almost never guess the same number twice in a row, although it happens quite often. -> If you program a shuffle+repeat-mode for music players, you have to fake your results and make the same song less likely to play again, or it won't feel random enough for most people ;-) – Falco Jul 22 '14 at 08:01
  • @immibis: Actually, no. It should be: "What are the chances that, assuming `n` people have random, equidistributed birthdays...". It's not the people that need to be randomly equidistributed, it's their birthdays (and birthdays are _very_ non-random, _very_ non-equidistributed!). It is a fallacy that "random people" have "random, equidistributed birthdays". It's a big irony calling this "birthday paradox" in the first place when it would never actually work with birthdays (interestingly enough, it works perfectly e.g. with hash functions, just not _birthdays_) :-) – Damon Jul 22 '14 at 09:34
  • 11
    @Falco My old MP3 player would "shuffle+repeat" by building a list of all songs, shuffling the list, then playing through it in a loop. In this way, once I heard a song, I would absolutely not hear it again until all other songs had played ;) Still random! – Niet the Dark Absol Jul 22 '14 at 09:46
  • 1
    @Damon that too. But my correction was to the second half - "share the same birthday" should be "have different birthdays" – user253751 Jul 22 '14 at 09:54
  • @NiettheDarkAbsol That should be an answer, because it provides a solution to the OP's problem rather than just answering the question. Make a list of the possible numbers, shuffle it. – Mr Lister Jul 22 '14 at 11:24
  • @MrLister Making a list of length 32^6 = 1,073,741,824 would be a *little* bit of a memory-hog. – Niet the Dark Absol Jul 22 '14 at 11:53
  • @MrLister: Then the question would be "How do I shuffle a 4GB list of numbers with less than 4GB of memory?" – Gabe Jul 22 '14 at 13:26
  • @MrLister The OP did not state that he did not _want_ duplicates, but was _surprised_ by having duplicates after only 40,000 selections. – D Stanley Jul 22 '14 at 13:30
  • 3
    @Damon: It depends what you mean by "very non-equidistributed". Nunnikhoven (http://www.jstor.org/stable/2685309) has used USA birthday distribution data and shown that the true probabilities differ in the 3rd decimal place from those calculated by assuming uniform distribution of birthdays. Many would find that error on the order of 0.1% is acceptable. – bdemarest Jul 22 '14 at 19:26
  • @DStanley not corrected. note my correction is to the second half. I'm not going to edit it myself as it would substantially change the meaning of your answer. – user253751 Jul 23 '14 at 11:01
  • @bdemarest: That is, with all due respect, utter nonsense. Not only does the number of births per day significantly vary according to season, but there is also a very noticeable clustering preceding certain public holidays. For example, it is very common practice to initiate birth a day or two early when new year's eve is nigh, so only the absolutely necessary births happen that day. Which you can immediately see from looking at the numbers, too (even without using any statistical means). If certain days have 3x the births that certain others have, you cannot speak of equidistribution. – Damon Jul 23 '14 at 11:04
  • 2
    I remember when people would post screenshots of 123456 or 888888 from their Battle.net authenticators to "prove" that the results were not random. Of course, if it had been impossible to produce such numbers, THEN the generator wouldn't have been random. Humans just aren't wired to understand numbers –  Jul 28 '14 at 20:04
41

This is known as the Birthday Problem and is just basic probability-theory.

The probability that N random numbers in the range 1 through K does not give a duplicate is:

enter image description here
enter image description here

To calculate the chance of getting at least one duplicate subtract the value from 1.

In your case it evaluates to

P(40000, 1073741823) = 1 - p(40000, 1073741823)

By using Wolfram Alpha to do the calculation the result is

0.5252888122305790

which means it's slightly more than 50% chance you'll get a duplicate. As you produce more numbers, you'll get duplicates more and more often.

Here are some more evaluations of N:

   N      Result
 40000    0.5253
100000    0.9905
200000    0.9999
participant
  • 2,923
  • 2
  • 23
  • 40
Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • 2
    @user11153 It would've been nice if you edited the text to describe the new formula, since the images is the formula for *not* having a duplicate. Remember when you edit questions/answers to make sure they mean the same afterwards. – Lasse V. Karlsen Jul 22 '14 at 16:05
  • Just as stimulation for a new question:) What if the question was to get exactly one duplicate? (3 times, 4 times ... would also not count) – grenix Sep 07 '16 at 07:18
6

The random number generator included in the Framework is pseudo-random without any guarantee of random number distribution. If you are concerned about distribution patterns, consider this article: http://www.codeproject.com/Articles/15102/NET-random-number-generators-and-distributions

Nevertheless, my statistics professors (not one) used to say, "There is a small lie, a big lie, and there is Statistics".

First the full code, so people don't have to scour the internet looking for class implementations to test:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            var r = RandomProvider.GetThreadRandom();

            Dictionary<int, int> resultDictionary = new Dictionary<int, int>();
            for (int x = 1; x <= 1000; x++)
            {
                Dictionary<string, int> dictionary = new Dictionary<string, int>();
                try
                {
                    while (true)
                    {
                        int rand = r.Next(0, 1073741823);
                        CrockfordBase32Encoding encoding = new CrockfordBase32Encoding();
                        string encodedRand = encoding.Encode((ulong)rand, false);
                        dictionary.Add(encodedRand, rand);
                    }
                }
                catch (Exception)
                {
                }
                Console.WriteLine("{0} - {1}", x, dictionary.Count);
                resultDictionary.Add(x, dictionary.Count);
                x++;
            }

            Console.WriteLine();
            Console.WriteLine("Average Number Before Duplicate: " + resultDictionary.Average(x => x.Value));
            Console.WriteLine("Minimum Number Before Duplicate: " + resultDictionary.Min(x => x.Value));
            Console.WriteLine("Maximum Number Before Duplicate: " + resultDictionary.Max(x => x.Value));
            Console.WriteLine(" Median Number Before Duplicate: " + resultDictionary.Select(x=>x.Value).Median());
            Console.ReadLine();
        }


    }

    public static class Extensions
    {
        public static double Median<T>(this IEnumerable<T> list)
        {
            List<double> orderedList = list.Select(s=>Convert.ToDouble(s))
                .OrderBy(numbers => numbers)
                .ToList();

            int listSize = orderedList.Count;
            double result;

            if (listSize % 2 == 0) // even
            {
                int midIndex = listSize / 2;
                result = ((orderedList.ElementAt(midIndex - 1) +
                           orderedList.ElementAt(midIndex)) / 2);
            }
            else // odd
            {
                double element = (double)listSize / 2;
                element = Math.Round(element, MidpointRounding.AwayFromZero);

                result = orderedList.ElementAt((int)(element - 1));
            }

            return result;
        } 
    }


    public static class RandomProvider
    {
        private static int seed = Environment.TickCount;

        private static ThreadLocal<Random> randomWrapper = new ThreadLocal<Random>(() =>
            new Random(Interlocked.Increment(ref seed))
            );

        public static Random GetThreadRandom()
        {
            return randomWrapper.Value;
        }
    }

    public class CrockfordBase32Encoding
    {
        const int Base = 32;
        const int CheckDigitBase = 37;

        static readonly IDictionary<int, char> valueEncodings;
        static readonly IDictionary<int, char> checkDigitEncodings;
        static readonly IDictionary<char, int> valueDecodings;
        static readonly IDictionary<char, int> checkDigitDecodings;
        static CrockfordBase32Encoding()
        {
            var symbols = new SymbolDefinitions();
            valueEncodings = symbols.ValueEncodings;
            checkDigitEncodings = symbols.CheckDigitEncodings;
            valueDecodings = symbols.ValueDecodings;
            checkDigitDecodings = symbols.CheckDigitDecodings;
        }

        public string Encode(ulong input, bool includeCheckDigit)
        {
            var chunks = SplitInto5BitChunks(input);
            var characters = chunks.Select(chunk => valueEncodings[chunk]);

            if (includeCheckDigit)
            {
                var checkValue = (int)(input % CheckDigitBase);
                characters = characters.Concat(new[] { checkDigitEncodings[checkValue] });
            }

            return new string(characters.ToArray());
        }

        internal static IEnumerable<byte> SplitInto5BitChunks(ulong input)
        {
            const int bitsPerChunk = 5;
            const int shift = (sizeof(ulong) * 8) - bitsPerChunk;
            var chunks = new List<byte>();
            do
            {
                var lastChunk = input << shift >> shift;
                chunks.Insert(0, (byte)lastChunk);
                input = input >> bitsPerChunk;
            } while (input > 0);
            return chunks;
        }

        public ulong? Decode(string encodedString, bool treatLastCharacterAsCheckDigit)
        {
            if (encodedString == null)
                throw new ArgumentNullException("encodedString");

            if (encodedString.Length == 0)
                return null;

            IEnumerable<char> charactersInReverse = encodedString.Reverse().ToArray();

            int? expectedCheckValue = null;
            if (treatLastCharacterAsCheckDigit)
            {
                var checkDigit = charactersInReverse.First();
                if (!checkDigitDecodings.ContainsKey(checkDigit)) return null;
                expectedCheckValue = checkDigitDecodings[checkDigit];

                charactersInReverse = charactersInReverse.Skip(1);
            }

            ulong number = 0;
            ulong currentBase = 1;
            foreach (var character in charactersInReverse)
            {
                if (!valueDecodings.ContainsKey(character)) return null;

                var value = valueDecodings[character];
                number += (ulong)value * currentBase;

                currentBase *= Base;
            }

            if (expectedCheckValue.HasValue &&
                (int)(number % CheckDigitBase) != expectedCheckValue)
                return null;

            return number;
        }
    }

    internal class SymbolDefinitions : List<SymbolDefinition>
    {
        readonly List<SymbolDefinition> extraCheckDigits = new List<SymbolDefinition>();

        public SymbolDefinitions()
        {
            AddRange(new[]
            {
                new SymbolDefinition { Value = 0, EncodeSymbol = '0', DecodeSymbols = new[] { '0', 'O', 'o' } },
                new SymbolDefinition { Value = 1, EncodeSymbol = '1', DecodeSymbols = new[] { '1', 'I', 'i', 'L', 'l' } },
                new SymbolDefinition { Value = 2, EncodeSymbol = '2', DecodeSymbols = new[] { '2' } },
                new SymbolDefinition { Value = 3, EncodeSymbol = '3', DecodeSymbols = new[] { '3' } },
                new SymbolDefinition { Value = 4, EncodeSymbol = '4', DecodeSymbols = new[] { '4' } },
                new SymbolDefinition { Value = 5, EncodeSymbol = '5', DecodeSymbols = new[] { '5' } },
                new SymbolDefinition { Value = 6, EncodeSymbol = '6', DecodeSymbols = new[] { '6' } },
                new SymbolDefinition { Value = 7, EncodeSymbol = '7', DecodeSymbols = new[] { '7' } },
                new SymbolDefinition { Value = 8, EncodeSymbol = '8', DecodeSymbols = new[] { '8' } },
                new SymbolDefinition { Value = 9, EncodeSymbol = '9', DecodeSymbols = new[] { '9' } },
                new SymbolDefinition { Value = 10, EncodeSymbol = 'A', DecodeSymbols = new[] { 'A', 'a' } },
                new SymbolDefinition { Value = 11, EncodeSymbol = 'B', DecodeSymbols = new[] { 'B', 'b' } },
                new SymbolDefinition { Value = 12, EncodeSymbol = 'C', DecodeSymbols = new[] { 'C', 'c' } },
                new SymbolDefinition { Value = 13, EncodeSymbol = 'D', DecodeSymbols = new[] { 'D', 'd' } },
                new SymbolDefinition { Value = 14, EncodeSymbol = 'E', DecodeSymbols = new[] { 'E', 'e' } },
                new SymbolDefinition { Value = 15, EncodeSymbol = 'F', DecodeSymbols = new[] { 'F', 'f' } },
                new SymbolDefinition { Value = 16, EncodeSymbol = 'G', DecodeSymbols = new[] { 'G', 'g' } },
                new SymbolDefinition { Value = 17, EncodeSymbol = 'H', DecodeSymbols = new[] { 'H', 'h' } },
                new SymbolDefinition { Value = 18, EncodeSymbol = 'J', DecodeSymbols = new[] { 'J', 'j' } },
                new SymbolDefinition { Value = 19, EncodeSymbol = 'K', DecodeSymbols = new[] { 'K', 'k' } },
                new SymbolDefinition { Value = 20, EncodeSymbol = 'M', DecodeSymbols = new[] { 'M', 'm' } },
                new SymbolDefinition { Value = 21, EncodeSymbol = 'N', DecodeSymbols = new[] { 'N', 'n' } },
                new SymbolDefinition { Value = 22, EncodeSymbol = 'P', DecodeSymbols = new[] { 'P', 'p' } },
                new SymbolDefinition { Value = 23, EncodeSymbol = 'Q', DecodeSymbols = new[] { 'Q', 'q' } },
                new SymbolDefinition { Value = 24, EncodeSymbol = 'R', DecodeSymbols = new[] { 'R', 'r' } },
                new SymbolDefinition { Value = 25, EncodeSymbol = 'S', DecodeSymbols = new[] { 'S', 's' } },
                new SymbolDefinition { Value = 26, EncodeSymbol = 'T', DecodeSymbols = new[] { 'T', 't' } },
                new SymbolDefinition { Value = 27, EncodeSymbol = 'V', DecodeSymbols = new[] { 'V', 'v' } },
                new SymbolDefinition { Value = 28, EncodeSymbol = 'W', DecodeSymbols = new[] { 'W', 'w' } },
                new SymbolDefinition { Value = 29, EncodeSymbol = 'X', DecodeSymbols = new[] { 'X', 'x' } },
                new SymbolDefinition { Value = 30, EncodeSymbol = 'Y', DecodeSymbols = new[] { 'Y', 'y' } },
                new SymbolDefinition { Value = 31, EncodeSymbol = 'Z', DecodeSymbols = new[] { 'Z', 'z' } },
            });

            extraCheckDigits.AddRange(new[]
            {
                new SymbolDefinition { Value = 32, EncodeSymbol = '*', DecodeSymbols = new[] { '*' } },
                new SymbolDefinition { Value = 33, EncodeSymbol = '~', DecodeSymbols = new[] { '~' } },
                new SymbolDefinition { Value = 34, EncodeSymbol = '$', DecodeSymbols = new[] { '$' } },
                new SymbolDefinition { Value = 35, EncodeSymbol = '=', DecodeSymbols = new[] { '=' } },
                new SymbolDefinition { Value = 36, EncodeSymbol = 'U', DecodeSymbols = new[] { 'U', 'u' } },
            });
        }

        public IDictionary<int, char> ValueEncodings
        {
            get
            {
                return this.ToDictionary(s => s.Value, s => s.EncodeSymbol);
            }
        }

        public IDictionary<int, char> CheckDigitEncodings
        {
            get
            {
                return this
                    .Union(extraCheckDigits)
                    .ToDictionary(s => s.Value, s => s.EncodeSymbol);
            }
        }

        public IDictionary<char, int> ValueDecodings
        {
            get
            {
                return this
                    .SelectMany(s => s.DecodeSymbols.Select(d => new { s.Value, DecodeSymbol = d }))
                    .ToDictionary(s => s.DecodeSymbol, s => s.Value);
            }
        }

        public IDictionary<char, int> CheckDigitDecodings
        {
            get
            {
                return this
                    .Union(extraCheckDigits)
                    .SelectMany(s => s.DecodeSymbols.Select(d => new { s.Value, DecodeSymbol = d }))
                    .ToDictionary(s => s.DecodeSymbol, s => s.Value);
            }
        }
    }

    internal class SymbolDefinition
    {
        public int Value { get; set; }
        public IEnumerable<char> DecodeSymbols { get; set; }
        public char EncodeSymbol { get; set; }
    }
}

I have added couple of additional output lines:

Average Number Before Duplicate: 41043.954
Minimum Number Before Duplicate: 2498
Maximum Number Before Duplicate: 127683
 Median Number Before Duplicate: 37860

Isn't that interesting, while the average is about 40k, look at the min and max, two orders of magnitude apart.

Randomness does not guarantee uniform distribution. In two consecutive throws of a dice, getting the number 4 on both throws is still random. Winning the lottery big prize twice or more in one lifetime has been done before.

If you need a more unique distribution per thread, I have included sample of RandomProvider from Jon Skeet's most excellent book (yes, I am a fanboy).

UPDATE

A small rewrite for parallel execution, because it is fun to torture silicon based life forms:

    static void Main(string[] args)
    {

        ConcurrentDictionary<int, int> resultDictionary = new ConcurrentDictionary<int, int>();
        Parallel.For(0, 1000, x =>
        {
            var r = RandomProvider.GetThreadRandom();
            ConcurrentDictionary<string, int> dictionary = new ConcurrentDictionary<string, int>();
                while (true)
                {
                    int rand = r.Next(0, 1073741823);
                    CrockfordBase32Encoding encoding = new CrockfordBase32Encoding();
                    string encodedRand = encoding.Encode((ulong) rand, false);
                    if (!dictionary.TryAdd(encodedRand, rand)) break;
                }
            Console.WriteLine("{0} - {1}", x, dictionary.Count);
            resultDictionary.TryAdd(x, dictionary.Count);
        });

        Console.WriteLine();
        Console.WriteLine("Average Number Before Duplicate: " + resultDictionary.Average(x => x.Value));
        Console.WriteLine("Minimum Number Before Duplicate: " + resultDictionary.Min(x => x.Value));
        Console.WriteLine("Maximum Number Before Duplicate: " + resultDictionary.Max(x => x.Value));
        Console.WriteLine(" Median Number Before Duplicate: " + resultDictionary.Select(x=>x.Value).Median());
        Console.ReadLine();
    }

and the results:

Average Number Before Duplicate: 41826.375
Minimum Number Before Duplicate: 1655
Maximum Number Before Duplicate: 134671
 Median Number Before Duplicate: 39119

UPDATE 2

So the author of the CodeProject article has published his work as a NuGet package:

Install-Package Troschuetz.Random

I've used the same sample code to test different generators:

StandardGenerator

Average Number Before Duplicate: 40434.148
Minimum Number Before Duplicate: 978
Maximum Number Before Duplicate: 136248
 Median Number Before Duplicate: 38845

ALFGenerator

Average Number Before Duplicate: 40395.845
Minimum Number Before Duplicate: 828
Maximum Number Before Duplicate: 125705
 Median Number Before Duplicate: 38042

MT19937Generator

Average Number Before Duplicate: 40478.174
Minimum Number Before Duplicate: 2723
Maximum Number Before Duplicate: 121367
 Median Number Before Duplicate: 38279

XorShift128Generator

Average Number Before Duplicate: 41463.732
Minimum Number Before Duplicate: 878
Maximum Number Before Duplicate: 111206
 Median Number Before Duplicate: 39013.5

So, there you have it. Enjoy for what it is worth to ya ..

Darek
  • 4,687
  • 31
  • 47
  • 3
    There is in fact a [bug in the built-in Random class](https://connect.microsoft.com/VisualStudio/feedback/details/634761/system-random-serious-bug), where it can have a shorter periodicity than was originally expected based on the algorithm being used. This is not the cause of the OP's issue, but is interesting information, nonetheless. – Dan Bryant Jul 21 '14 at 21:46
  • @Sayse what, an answer that has a link shouldn't be an answer, because the linked site may go offline sometime in the future? You're being silly. – Mr Lister Jul 22 '14 at 11:21
  • 1
    @Darek "look at the min and max, two orders of magnitude apart" - I would expect your `min` and `max` to approach `2` and `1073741823` with more trials. – D Stanley Jul 22 '14 at 13:32
  • @DStanley you are absolutely right, given enough iterations that would be a realistic expectation. – Darek Jul 22 '14 at 13:37
  • @MrLister, rather, an answer that is incomplete without a link shouldn't be an answer. Answers should be self-contained enough to stand on their own, whether or not links remain valid... as this one now is. This has been well-discussed on meta, and is a matter on which standing consensus has existed for years. – Charles Duffy Jul 22 '14 at 15:08
  • [Are link only answers really good answers?](http://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers) – Sayse Jul 22 '14 at 16:34
  • @DStanley Perhaps not with the built-in random generator, or we would need a lot more iterations. With 100k iterations, I have arrived at Average 40948.41929, Minimum 67, Maximum 160093 and Median 38492 – Darek Jul 22 '14 at 17:02
  • @DStanley Actually ... we might not arrive at `2` and `1073741823`, well maybe ... a two would be an immediate duplicate.. possible? Sure ... likely? Not really. Same with the other side ... given the usual distributions we would have to wait a very long time for that event to occur IMHO. – Darek Jul 22 '14 at 18:07
  • @Derek yes that's why I said _approaches_. You'd get a duplicate on the second match every 1 trillion tires on average; I'd have to do the math on a set with _no_ duplicates. My point was the "order of magnitude" statement wasn't really accurate - it's not _wrong_ but the magnitude difference could be much _higher_. – D Stanley Jul 22 '14 at 18:11