1

This is my function:

private def generateOneThousandRandomNumbers(listOfNumbers: List[String] = List.empty): List[String] = {
  if (listOfNumbers.size == 1000) {
    listOfNumbers
  } else {
    val nextNumber: String = Random.nextInt(10000000).toString
    if (listOfNumbers.contains(nextNumber)) {
      println("DUPLICATE NUMBER GENERATED: " + nextNumber)
    }
    generateOneThousandRandomNumbers(listOfNumbers ++ List(nextNumber))
  }
}

And I have ten tests exactly like this:

"areUnique1" in {
  val x = generateOneThousandRandomNumbers()
  x.size shouldBe x.distinct.size
}

So by my calculations, with one test, it should only create a duplicate 1/10,000 runs, and with 10 tests it should only create a duplicate 1/1,000 runs. However, it is creating duplicates on about 50% of runs and I'm not sure why.

Edward
  • 13
  • 2
  • Not sure how the `Random` object works in Scala but usually in Java you want to create a unique instance of random and reuse it to enforce more randomness. I fear here it reinstantiate a Random each time with less randomness. – Gaël J Nov 11 '22 at 16:17
  • 1
    Does this answer your question? [Java Random Class not truly random?](https://stackoverflow.com/questions/13535952/java-random-class-not-truly-random) – Gaël J Nov 11 '22 at 16:19
  • I just tried making a new Random and even making the thread sleep for a ms each time too, but no resolution. – Edward Nov 11 '22 at 16:25
  • 1
    Can you please provide a [minimal reproducible program](https://stackoverflow.com/help/minimal-reproducible-example) that generates the 50% duplicate rate? How are you doing the replications, and are you doing anything that plays with `Random` between trials? When you just give fragments of the code, we don’t know what you’ve done outside those fragments. – pjs Nov 11 '22 at 20:56

1 Answers1

6

According to the Birthday Paradox, you only need ~23 people in a room before there is a 50% chance of 2 of them sharing a birthday, despite the fact there are 365 different possible birthdays.

It's the same with your code: you have 10,000,000 different possible values, but if you put more than ~sqrt(10,000,000) ~= 3162 of them in a container, there will be a >50% chance of two of them being the same.

You're only putting 1000 in your container, so the chance of there being a collision isn't quite 50%, but it's still going to be pretty high.

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • Hmm interesting although I can't say I get it yet. I am ultimately trying to create unique references, so does the above mean that even if we generate 1 at a time for each user, having 10,000,000 possible variations (per day) is not enough? – Edward Nov 11 '22 at 16:26
  • I did the exact "birthday" calculation (using rationals to avoid loss of precision), and for a draw of 1_000 from a pool of 10_000_000 the probability of one or more duplicates is 0.048724596001051834. That's certainly higher than Edward's belief of 1/1000 = 0.001, but an order of magnitude less than his reported 50%. Either Edward has misreported the result, or something other than the birthday paradox is going on. @Edward, are you manually seeding the PRNG? – pjs Nov 11 '22 at 20:13
  • P.S. - the actual break-even quantity where you switch from p < 0.5 to p >= 0.5 is n = 3724. – pjs Nov 11 '22 at 20:18