2

I'm generating random IDs in javascript which serve as unique message identifiers for an analytics suite.

When checking the data (more than 10MM records), there are some minor collisions for some IDs for various reasons (network retries, robots faking data etc), but there is one in particular which has an intriguing number of collisions: akizow-dsrmr3-wicjw1-3jseuy.

The collision rate for the above id is at around 0.0037% while the rate for the other id collisions is under 0.00035% (10 times less) out of a sample of 111MM records from the same day. While the other ids are varying from day to day, this one remains the same, so for a longer period the difference is likely larger than 10x.

This is how the distribution of the top ID collisions looks like enter image description here

This is the algorithm used to generate the random IDs:

function generateUUID() {
    return [
        generateUUID4(), generateUUID4(), generateUUID4(), generateUUID4()
    ].join("-");
}

function generateUUID4() {
    return Math.abs(Math.random() * 0xFFFFFFFF | 0).toString(36);
}

I reversed the algorithm and it seems like for akizow-dsrmr3-wicjw1-3jseuy the browser's Math.random() is returning the following four numbers in this order: 0.1488114111471948, 0.19426893796638328, 0.45768366415465334, 0.0499740378116197, but I don't see anything special about them. Also, from the other data I collected it seems to appear especially after a redirect/preload (e.g. google results, ad clicks etc).

So I have 3 hypotheses:

  1. There's a statistical problem with the algorithm that causes this specific collision
  2. Redirects/preloads are somehow messing with the seed of the pseudo-random generator
  3. A robot is smart enough that it fakes all the other data but for some reason is keeping the random id the same. The data comes from different user agents, IPs, countries etc.

Any idea what could cause this collision?

icenac
  • 400
  • 4
  • 11
  • 1
    you should allocate the numbers on your server side. never trust the client. – Daniel A. White Aug 03 '20 at 14:24
  • 2
    Wait, why wouldn't you just generate a seeded, full UUID? Generating chunks and concatenating them is actually going to create collisions. Take a look at this: [How to create GUID / UUID?](https://stackoverflow.com/questions/105034/how-to-create-guid-uuid) – Mr. Polywhirl Aug 03 '20 at 14:24
  • 1
    isn't it better to use [`Crypto.getRandomValues()`](https://developer.mozilla.org/en-US/docs/Web/API/Crypto/getRandomValues) instead of `Math.Random()`? – Sebastian Brosch Aug 03 '20 at 14:28
  • @DanielA.White I know, but in this particular situation, this is a client-side library that runs in various environments and having a server-side id generator is not possible or too expensive – icenac Aug 03 '20 at 14:29
  • @Mr.Polywhirl that's a good point, but I don't understand why it creates this collision in particular more than others – icenac Aug 03 '20 at 14:31
  • @SebastianBrosch Unfortunately Crypto doesn't have the necessary coverage for the library I'm working on – icenac Aug 03 '20 at 14:31
  • You could start by writing code that isolates your 3 hypotheses, i.e. try to write code that reproduces the problem in each separate scenario. – Kokodoko Aug 03 '20 at 14:34
  • 1
    As far as I know javascript does not specify an implementation for Math.random, so any results should be javascript-engine dependent. *...has an intriguing number of collisions: akizow-dsrmr3-wicjw1-3jseuy...* Well, how many collisions out of how many samples? Without that information your question is very difficult to answer unless we try to reproduce it ourselves. – President James K. Polk Aug 03 '20 at 15:10
  • @PresidentJamesK.Polk added some concrete numbers for the collision rate. The collision for that specific ID is at least 10x more frequent than for any of the other IDs – icenac Aug 04 '20 at 07:02
  • Which javascript engine? – President James K. Polk Aug 04 '20 at 11:49
  • @PresidentJamesK.Polk already mentioned in the 3rd hypothesis that the data comes from different user agents, IPs, countries etc. I've seen data from at least chrome, safari and edge – icenac Aug 04 '20 at 18:26

0 Answers0