I've encountered some code that generates a number of UUIDs via UUID.randomUUID()
, takes the last 7 digits of each (recent versions of UUID are uniformly distributed in terms of entropy), and uses that as a key to insert rows into a database.
I wondered what the probability of collision was. I remembered the Birthday Problem. This is an instance of that problem, is it not? Instead of 365 days in a year, there are 16^7 possible strings. Then according to that Wikipedia page, the probability of collision given n strings is
where d is 16^7.
A colleague asserted that they were able to insert 50,000 rows without an issue. I doubted this because the probably of collision with n=50,000 and d=16^7 is
1-((16^7-1)/16^7)^(50000*(50000-1)/2) = 99.05%
But then I checked our database. Indeed the 50,000 inserts succeeded.
How was that possible? (Besides that they got really lucky.) Am I misapplying the Birthday Problem?
Edit
Here's a minimal test that shows there should have been collisions.
import java.util.UUID;
import java.util.HashSet;
public class MyClass {
public static void main(String args[]) {
final int N = 50000;
for (int trial = 0; trial < 10; trial++) {
HashSet<String> strings = new HashSet<>();
Boolean success = true;
for (int i = 0; i < N; i++) {
String uuid = UUID.randomUUID().toString();
String id = uuid.substring(uuid.length() - 7);
if (strings.contains(id)) {
success = false;
// System.out.println(id);
break;
}
strings.add(id);
}
System.out.println(success);
}
}
}
For N=50,000, 10 out of 10 attempts had a collision—which would be expected of a 99% collision rate. For N=10,000, I see 0, 1, 2, or 3 out of 10 attempts with collisions, which all fall within expectations for a 17% collision rate.