Unique identifier with smallest base possible, without collisions on old/new set of data

Question

I receive one or more trees of nodes.

Nodes may, or may not have ID properties on them.

Currently I am iterating through the tree and adding random 8 digit numbers on nodes which do not have ID property. As I do not expect more than 10k nodes in the trees chance of having collisions is very small.

Still I am considering how best to reduce the length of the IDs to maybe 4 digits while making sure there are no collisions within one tree. What comes to my mind is to iterate once through the tree gathering existing IDs into a Set and than again adding new IDs while checking against the Set that there are no collisions. Set would have to be reset for each tree.

I would appreciate your opinion on this matter and advice if there are more performant ways of achieving this.

Appendix A:

I am considering following (simplified 0-9) issue. If I have a Set of existing IDs [0, 1, 2, 5, 8, 9] I would have to generate random numbers until I get e.g. 4 (no collision) which I am concerned would be a bit slow on a larger Set and surely not the optimal route.

It's a tree. Wouldn't you know at the time of insertion whether you're duplicating an existing node or not? — Robert Harvey, May 22 '20 at 20:52
@RobertHarvey I receive a tree where some nodes have while others do not have IDs. I'd like for all to have UniqueIDs. — radulle, May 22 '20 at 20:53
I'll rephrase my question. How did your existing tree know where to put each node? — Robert Harvey, May 22 '20 at 20:54
@RobertHarvey I do not have information on how these trees were originally built. By using IDs I can easily add children/siblings. — radulle, May 22 '20 at 20:56
Why don't you just walk your existing tree and build a new one? — Robert Harvey, May 22 '20 at 20:57
People spend years polishing the solution of unique id generation problem. I would look at https://github.com/ai/nanoid#js Even if you prefer crafted solution, I guess you might find some inspiration looking how they did it. — rumata28, May 22 '20 at 20:58
@RobertHarvey but they I would lose IDs of the old one and there might be some modules that I am not aware of that use them. — radulle, May 22 '20 at 20:58
Can't you just make some simple counter variable? And when you need it, then just increase the counter by one and use it as an ID? — Mateusz, May 22 '20 at 21:06
I don't propose that you throw out your old IDs, only that you replace the missing ones. — Robert Harvey, May 22 '20 at 21:10
@MateuszSamsel I am considering something like that but that would mean overwriting existing IDs might be connected to some history/logs so I am not sure it is the best approach. — radulle, May 22 '20 at 21:15
@radulle if you need to preserve old ID then simplest approach would be store already used ID in some Array or Set. Then for every node which doesn't have ID you take the number form the counter. In case that number is in used one just increment until you find first free one. You could wrap it in some abstraction like generators or iterators, to hide those counters and arrays or alternatively wrap it inside some function. But it depends on your project if you need a fast and simple solution or pretty one. — Mateusz, May 22 '20 at 21:25
@MateuszSamsel I will probably go down this road and due to performance have to go with a counter which would be determened by the max value I receive from existing Tree. — radulle, May 23 '20 at 08:46

score 1 · Answer 1 · answered May 22 '20 at 21:38

1

Here you have a really simple approach which will generate for you an array of not used numbers with a given MAX range.

const MAX = 30;
const usedNumbers = [3, 4, 12, 13, 14, 16, 23, 27];

// https://stackoverflow.com/questions/3746725/how-to-create-an-array-containing-1-n
const notUsedNumbers = Array.from(Array(MAX), (_, i) => i+1).filter(i => !usedNumbers.includes(i));

console.log(notUsedNumbers);

And link to fiddle: https://jsfiddle.net/L9r6anq1/

answered May 22 '20 at 21:38

Mateusz

683
4
17

This is an interesting, albeit hard to read and on larger ranges extremely slow, almost unusable solution. – radulle May 23 '20 at 08:42
@radulle what do you mean by extremely slow? 4 digit numbers are max `10000` and from the fast test, in my browser, it takes like 25 - 40 milliseconds to execute it. https://jsfiddle.net/x39L0utm/1/ What performance would you like to have? – Mateusz May 23 '20 at 18:34
When I tried to run it it took much more and Chrome got hung, I will have to recheck, still if I were to go down this path I would have to do it once per tree and than keep that array in memory and shifting it when needed. – radulle May 23 '20 at 20:03
Maybe it's not the ID generation a real problem but some additional operation which you perform on the nodes? I encourage you to play with [performance API](https://developer.mozilla.org/en-US/docs/Web/API/Performance) to precisely measure the execution time of part of your code. Anyway if you don't want to duplicate IDs, then you need to store somewhere already used ones and compare if ID which you already take is taken or not.Or you need to overwrite all ID according to your needs so you will know there is no duplicates. – Mateusz May 24 '20 at 09:58

score 1 · Answer 2 · answered May 25 '20 at 20:09

10^8 possibilities with random selection means you have a 50% chance of collision with only 10^4 objects (see Birthday Paradox); that is not "very small" odds. Reducing that to only 10^4 possibilities with 10^4 objects means collisions will approach 100% as you get toward the end and, if you ever have 10k+1 objects one day, will never terminate.

In general, if you want to use a relatively short ID space, you're going to need a very efficient conflict detection system, e.g. keeping all assigned (or not assigned) values in a scratchpad, or just give up on randomly assigning values and go sequentially.

Unique identifier with smallest base possible, without collisions on old/new set of data

2 Answers2