0

I know that random is not truly random, but my test has a massively different result than the calculated probability.

Please find the error if there is one.

Below code is generating a string of length 3 out of 62 choices and at what point it repeates. The probability I calculated of repeating is 238.328‬ or 62^3.

Wen I run my code it repeats on average after 615 times.

    import random
    import string
    
    def get_random_string(length):
        choices = string.ascii_letters + string.digits
        result_str = ''.join(random.choice(choices) for i in range(length))
        return result_str
    
    def main():
        global repeated
        results = []
        for i in range(100000):
            tmpstr = get_random_string(3)
            if tmpstr in results:
                print(f"it took {i} times to repeate.")
                repeated.append(i)
                break
            else:
                results.append(tmpstr)
    
    if __name__ == "__main__":
        print(len(string.ascii_letters + string.digits))
        repeated = []
        for i in range(10):
            main()
        print(sum(repeated)/len(repeated))
        print("done")
Armin
  • 127
  • 5
  • 4
    https://en.wikipedia.org/wiki/Birthday_problem – tevemadar Sep 05 '22 at 14:06
  • I would expect it to repeat only every 238328 times‬ on average. I want to add a random string to a filename to make it unique and this script should help me figure out whats the minimum number of characters that I can get away with. – Armin Sep 05 '22 at 14:09
  • 3
    *The probability I calculated of repeating is 238.328‬ or 62^3* - nope, a random number generator is not expected to generate every single number before generating a number again. That's called shuffling. – tevemadar Sep 05 '22 at 14:13
  • 2
    Thanks @tevemadar you are right, my math is wrong. – Armin Sep 05 '22 at 14:16
  • 1
    *I want to add a random string to a filename to make it unique* - add a timestamp. If your code may run on more machines at the same time, add a timestamp and a machine id. – tevemadar Sep 05 '22 at 14:16
  • 4
    Using Ramanujan's approximation from https://en.wikipedia.org/wiki/Birthday_problem#Average_number_of_people_to_get_at_least_one_shared_birthday , the average is 612.5 – Paul Hankin Sep 05 '22 at 14:27
  • Probabilities are always values between 0 and 1. I’m guessing you tried calculating the inverse? – pjs Sep 05 '22 at 14:42
  • The timestamp method could potentially fail if you were creating multiple files at the same time, if that's needed then `uuid4().hex` will basically guarantee a unique string. – Peter Sep 05 '22 at 14:56

1 Answers1

0

As tevemadar suggested my math is wrong. en.wikipedia.org/wiki/Birthday_problem is a hint to why.

Paul Hankin calculated the average to be 612.5 which lines up with my tests.

Armin
  • 127
  • 5