3

Say you want to have a set of 1- to 2-digit hexadecimal numbers, so 256 numbers. Just using a small set to get at the problem, but it would work with any sized string.

So you have a potential N or 256 numbers in this case. You are going to "generate" a new ID for every new data record that comes your way. So it starts of and randomly gives you af, then 1d, then 8a, etc.

The straightforward naïve way to do this is to just simply generate all the numbers in order, then shuffle them, and just pop from the set. This works fine when you only have 256 numbers. But if you have millions or billions of numbers it is impractical as you might have a lot of waisted generated IDs not being used for long periods of time. I would like to avoid this.

So my question is what is the or a fastest way to create a unique key string like this, without generating all of them in advance, and without going in order just incrementing by 1 or whatnot. That is, the key should be seemingly random.

One way I can imagine is using a trie to store the already used/generated values. Then when you are to get a new value, you generate a random one, then check the trie to see if it's already used. I have no idea how to tell how efficient this is though, but it seems like it would be very bad performing once you start running out of IDs and are down to the last few ones in the set. You would generate lots of already generated IDs, and traverse the trie for each, so it would be slow.

I am wondering if there is a more efficient way of doing this, without generating them all in advance. Also, the data records won't be used in figuring out the ID, as the records might be extremely large and complex.

Maybe there is a way to sort of randomly traverse (and generate) a trie at once, and in that way generate the ID since you end up at a unique random place in the trie. Something along those lines perhaps, I don't know.

Also, I am not sophisticated with hashing so I don't know if there would be any good methods with that.

Lance
  • 75,200
  • 93
  • 289
  • 503
  • If each input data is unique, you can use a hash function on the input, and transform it into a number. I wonder if there's some not-too-complicated mathematical function which can generate something unique from 0 to N on each call of `fn(0)`, `fn(1)`, etc – CertainPerformance Feb 06 '19 at 08:41
  • The input data wouldn't be used to generate the ID, as they could be extremely large and complicated objects. – Lance Feb 06 '19 at 08:43
  • Please remove subjective wording like _"fastest way"_. Worry about solving your problem before you worry about making it faster – Mulan Feb 06 '19 at 08:43
  • 2
    I know how to solve the problem in a slow way as already described, hence the fastest way. – Lance Feb 06 '19 at 08:44
  • @user633183 Well, he *does* have a way to solve the problem currently, it just gets really slow the closer he gets to the cap. – CertainPerformance Feb 06 '19 at 08:44
  • Would you consider using a timestamp having been converted to hex as a viable option? I mean it's simple and would work to a certain extent to say the very least. – JO3-W3B-D3V Feb 06 '19 at 08:44
  • No no timestamp, I want to have complete control over the structure of the ID. – Lance Feb 06 '19 at 08:45
  • @JO3-W3B-D3V using a timestamp is [not recommended](https://stackoverflow.com/a/14869745/633183) – Mulan Feb 06 '19 at 08:45
  • @LancePollard Fair enough, in that case I have a number of potential ideas. – JO3-W3B-D3V Feb 06 '19 at 08:47
  • 2
    What if you really did use just a plain counter, like `1`, `2`, `3`, etc, but then hashed the result as the ID? That way the IDs will be unique, and only generated on demand, but they won't have any apparent meaning or order outside of the script (I guess the hashes can be transformed back into a number if you need a number rather than a string) – CertainPerformance Feb 06 '19 at 08:48
  • I think you should use a hash table with [`Open addressing`](https://en.wikipedia.org/wiki/Hash_table#Open_addressing). Just take the beginning of your data, it will insure no hash collisions. – Yonlif Feb 06 '19 at 08:49

5 Answers5

2

I assume that you could generate sequential IDs; that is, that you have a reliable way of knowing exactly how many IDs have been generated to date. Then it is sufficient to encrypt this count with any reasonably fast encryption algorithm.

The encryption would be done on the count as a binary number, and the encrypted result with most algorithms would be the same size, also binary. If desired, you could base-64 or hex encode the result to make it easier to use as a character string.

Since encryption must be a bijection (that is, a one-to-one mapping) in order for decryption to be possible, this is guaranteed to produce a different result each time until the total ID count overflows. If it is a reasonable encryption function, then the result will appear random (otherwise the cipher would be vulnerable).

rici
  • 234,347
  • 28
  • 237
  • 341
  • That's exactly the same process I outlined in my answer a day before but using "encryption" as the mixing function... https://stackoverflow.com/a/54549632/1030527 – bernie Feb 08 '19 at 10:15
  • 1
    @bernie: I guess that's true; I didn't really read your answer that closely. These things happen; I've often had my answers repeated, sometimes years later. I guess the advantage of suggesting the use of encryption is that most programmers will know how to find a library to do that, while "mixing function" sounds like something which would have to be researched and implemented. Apparently neither answer was what OP was looking for, though, so I suppose there is some other unspecified constraint. – rici Feb 09 '19 at 01:39
1

I am not sure how performant it will be but my idea is use a object or Map and Math.random()

let obj = {}

function generateRandomId(){
  let id = Math.abs( 0.5 - Math.random()) * 1000
  if(obj[id]){
   generateRandomId() 
  } else {
    obj[id] = true
  }
  return id
}

console.log(generateRandomId())
console.log(generateRandomId())
console.log(generateRandomId())
console.log(generateRandomId())

But if you are ok with using a modules i find this one is most useful

uuid this generates RFC4122 UUIDS.

Code Maniac
  • 37,143
  • 5
  • 39
  • 60
0

I think a mixing function is what you want. It will move bits around in your input to produce an output of the same length. It's reversible so each input corresponds to a unique output.

Since you want the input data to not take part in the id generation, you'll need a surrogate id. You can assign an incrementing id to each record and use the mix function to scramble the id.

You will get something like:

  • Record A => id == 1 => mixed id == 0x7ed55d16
  • Record B => id == 2 => mixed id == 0xc761c23c
  • etc.

See here for some inspiration:

bernie
  • 9,820
  • 5
  • 62
  • 92
  • OP said in comments: `The input data wouldn't be used to generate the ID, as they could be extremely large and complicated objects` – CertainPerformance Feb 06 '19 at 08:49
  • 1
    This question is a moving target! I changed the answer a bit to fit that requirement. – bernie Feb 06 '19 at 08:55
  • Wondering if you could recommend an implementation, these all seem rather complicated and I don't know really how to evaluate them for usage. – Lance Feb 06 '19 at 17:10
  • In my example I used 2-length hex values like `8a`, but you have `0x7ed55d16` very large values. It looks like the mixing functions linked all produce very large values, so this is incorrect :/ The values should be the size of the input, so 256 or less. – Lance Feb 07 '19 at 05:45
  • @LancePollard well which one is it? You say you want to handle "millions or billions of numbers". Please clarify what you want before I spend more time on this. – bernie Feb 07 '19 at 06:52
  • I want it to be proportional to the sample size. So if it is 256 possible values, then in hex it should only be 2-characters max, but if it's 1 million values, then hey a 32-bit integer or hex value would be fine. – Lance Feb 07 '19 at 07:47
  • A mixing function can be created for any desired bit length. It's just an algorithm that moves bits around in a reversible manner. Don't know why I got that downvote when this is clearly helpful. – bernie Feb 07 '19 at 08:21
0

I think that there should be some tradeoff between speed, flexibility and efficiency.

On one had pseudo random generators will give you that even distribution of keys and will be reasonably fast to generate. However checking for an existing id would be slow. You can use bloom filters (saving memory) or tries but then as you said at some point you should increase the space.

Another option is to use Gray code which will produce every key (but not at random order). You need to keep track of the last issued code.

Mariy
  • 5,746
  • 4
  • 40
  • 57
0

I am considering something like this:

var trie = buildTrie()
var id1 = genId(trie)
var id2 = genId(trie)

console.log(id1,id2)

function buildTrie() {
  var trie = buildNode(0)
  return trie

  function buildNode(level) {
    if (level == 7) { // 8 bits
      var node = {
        available: true,
        leaf: true
      }
      return node
    } else {
      var a = buildNode(level + 1)
      var b = buildNode(level + 1)
      var node = {
        availableLeft: true,
        availableRight: true,
        left: a,
        right: b
      }

      a.parent = node
      b.parent = node

      return node
    }
  }
}

function genId(node) {
  var bytes = []
  step(node, bytes)
  var id = parseInt(bytes.join(''), 2).toString(16)
  return id

  function step(node, bytes) {
    if (node.leaf) {
      node.available = false
      var c = node
      var p = c.parent
      while (p) {
        if (p.left == c) {
          p.availableLeft = false
        } else if (p.right == c) {
          p.availableRight = false
        }

        if (!p.availableLeft && !p.availableRight) {
          c = p
          p = p.parent
        } else {
          p = false
        }
      }
    }

    var randomDirection = Math.random() >= 0.5
    if (randomDirection) {
      if (node.availableLeft) {
        bytes.push(0)
        step(node.left, bytes)
      } else if (node.availableRight) {
        bytes.push(1)
        step(node.right, bytes)
      }
    } else {
      if (node.availableRight) {
        bytes.push(1)
        step(node.right, bytes)
      } else if (node.availableLeft) {
        bytes.push(0)
        step(node.left, bytes)
      }
    }
  }
}

Maybe there is a better way.

Lance
  • 75,200
  • 93
  • 289
  • 503
  • This is a binary tree so you have `2^{nbLevels + 1} - 1` nodes containing a couple of fields. Each node is also built up-front in `buildTrie()`. This will quickly use a lot of memory for "millions or billions of numbers". – bernie Feb 06 '19 at 10:02
  • @bernie wondering if you can recommend something that is more space efficient then (https://stackoverflow.com/questions/245878/how-do-i-choose-between-a-hash-table-and-a-trie-prefix-tree). – Lance Feb 06 '19 at 11:10
  • 1
    I don't see anything in your question that *requires* ids to be truly randomly assigned. Is that a requirement or do you simply want a non-obvious identifier for each data record? If true randomness is not required, it doesn't get more compact than my answer that stores a single number per record. Otherwise, the used random ids need to be stored somewhere (hash table, trie, array, etc.) – bernie Feb 06 '19 at 11:32
  • Thank you that makes sense. No I just don't want incremented IDs. – Lance Feb 06 '19 at 11:44