1

I have a set of randomly generated ints, and an evolve(set) function that takes a set, removes two ints and inserts 4 new ints. I want the set never to have repeated ints. That is, no matter how many times I apply evolve(evolve(...evolve(set))), it should never have duplicated values.

What is a simple choice of combine(int x, int y) -> int that guarantees that property with high probability?

Example:

// Helpers.
var range = length => Array.from({length}, i => i);
var random = () => (Math.random() * Math.pow(2,31) | 0);
var removeRandom = set => set.splice(random() % set.length, 1)[0];
var hasRepeated = set => !set.sort((a,b) => a - b).reduce((a,b) => a && a !== b && b || 0, -1);

// Receives two parent ints, returns 4 derived ints.
// Can't use random() here; it must generate 4 ints
// from simple functions (multiplication, bitwise
// operations, etc.) of `a` and `b`.
function combine(a, b) {
  return [a + b + 0, a + b + 1, a + b + 2, a + b + 3];
};

// Removes two ints from set, inserts 4 ints.
function evolve(set) {
  var a = removeRandom(set);
  var b = removeRandom(set);
  set.push(...combine(a,b));
};

// Random initial set of ints.
var set = range(64).map(random);

// No matter how many times `evolve` is applied, 
// the set should never have repeated ints.
for (var i = 0; i < 50000; ++i) {
  evolve(set);
}

// Prints `true` if algorithm is correct
// (no repeated number was generated).
console.log(!hasRepeated(set));

Notice that, here, combine(a,b) just adds them together. This is a bad choice; with as few as 50000 calls to evolve, it already fails the test. I could use a linear congruential generator, but I wonder if it can be simpler than that under those conditions. Perhaps some use of XOR?

MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
  • 2
    `const range = length => Array.from({length}, i => i);` and `const hasRepeated = set => (new Set(set)).size !== set.length` – Jonas Wilms Dec 08 '17 at 18:19
  • 1
    @JonasW. woops, I just asked that on #javascript, but forgot to use it. Edited! – MaiaVictor Dec 08 '17 at 18:20
  • 1
    I mean you are using ints -- the best you can hope to do is log base 4 maxint iterations that is how generations it will take to fill up the set space if you had perfect distribution – Hogan Dec 08 '17 at 18:24
  • Maybe add up the two biggest numbers of the set? Or just generate random numbers and check if they exist already – Jonas Wilms Dec 08 '17 at 18:25
  • The best way to have perfect distribution is just have a counter and assign the next value for each new element. – Hogan Dec 08 '17 at 18:25
  • 1
    Looks like a good case for using a **[`Set`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set)** which only allows unique values – charlietfl Dec 08 '17 at 18:26
  • @JonasW. -- why would that be better, the problem is the results roll over to zero – Hogan Dec 08 '17 at 18:26
  • @Hogan by using 32-bit ints, I hope that I can perform something around `2^16` calls to `evolve` without generating duplicated ints (with high probability), which is what a pseudorandom would give me if my math is correct. (Edit: oh, which is what you said.) – MaiaVictor Dec 08 '17 at 18:26
  • @Hogan I can't have a global counter because locality (imagine a GPU kernel - `combine` could be called in parallel in different processing units), so I need to somehow carry the initial randomness to children of two previously random values! – MaiaVictor Dec 08 '17 at 18:27
  • 2
    exactly what I said 2^16 = log base 4 2^32 – Hogan Dec 08 '17 at 18:27
  • I don't know what you mean by "carry" the randomness -- do you want to be able to identify children? – Hogan Dec 08 '17 at 18:29
  • if you are "caring information" then you can't have perfect distribution -- you need at least log base 2 current iteration bits for that information best case. – Hogan Dec 08 '17 at 18:31
  • @Hogan pseudorandom number generators rely on global counters; on this case, the only information I have is the value of `a` and `b`. That's what I mean. I don't want a perfect distribution, I just want to avoid that two parallel calls to `combine(a,b)` return the same int. – MaiaVictor Dec 08 '17 at 18:31
  • How will `evolve` choose the two ints to remove, can we make any assumptions about that? – Bergi Dec 08 '17 at 19:04
  • What do you mean by "int", is it specifically an int32 or int64 or an int representable with js numbers, or are you looking at a generic pure algorithm that can work with arbitrarily large integers? – Bergi Dec 08 '17 at 19:05
  • @Bergi it will be completely unpredictable (two arbitrary particles could interact), so assume it just picks two random ints. – MaiaVictor Dec 08 '17 at 19:06
  • Is this helpful? [mapping-two-integers-to-one-in-a-unique-and-deterministic-way](https://stackoverflow.com/questions/919612/mapping-two-integers-to-one-in-a-unique-and-deterministic-way) – ppovoski Dec 08 '17 at 19:06
  • @Bergi `combine :: (int64,int64) -> int64` is fine to me. This question could perhaps be rephrased as "what is a `(int64,int64) -> int64` hash-like function which can have awful overall properties in exchange of simplicity, the only property I care about is that successive calls are unlikely to collide"? – MaiaVictor Dec 08 '17 at 19:06
  • @ppovoski not really, pairing functions quickly get out of the `int64` range. I'll take a look on the other answers, though. – MaiaVictor Dec 08 '17 at 19:08
  • I do not think there is a way to do this randomly -- If I understand correctly I believe the best way to do this is the same way you map a b-tree to an array. Are you familiar with that math? In this case the b-tree would have 4 nodes per leaf. – Hogan Dec 08 '17 at 19:15
  • @Hogan why it wouldn't work? For example, `combine = (a,b) => { var r = uint64(sha256(a|b)); return [r,r+1,r+2,r+3] }` would work fine, as it uses `a` and `b` to generate a random number on the 64-bit space (`uint64(sha256(a | b))`), then get its neighbors... I was just looking for something simpler than a very complex hash function. – MaiaVictor Dec 08 '17 at 19:25
  • That makes no sense -- the whole point of a hash function is that it has no relationship to the source data. But you imply that multiple calls on results don't repeat. In a hash function nothing from the input is decernable from the result -- so an expectation that a prior "generation" would have any impact does not make sense. I think you just want a hash function and stop thinking about prior items "effects." – Hogan Dec 08 '17 at 19:29
  • @Hogan I think in hash functions you'd expect that multiple calls on results don't repeat because hashes are collision resistant. If they did, you'd find a collision (i.e., `H(a,H(b,c)) = H(b,c)`, for example). I'm asking something weaker than full collision resistance, though, I believe. – MaiaVictor Dec 08 '17 at 19:46
  • exactly my point, even talking about hashes in that way does not make sense -- so it does not make sense to ask for something "weaker". That is not the way hashing functions work. If you want to "make sure" that kind of thing does not happen then you have to use a non-hasing function, which in most cases is going to be much worse than a hashing function. SO WHY? Hashing functions work well and are well understood. Why make something weaker that won't work. Do you need some functionality here or not? if not just use a hash. – Hogan Dec 08 '17 at 19:55
  • @Hogan I can't afford a hash function inside a compute unit in a CUDA kernel, that's why I specifically asked for something simpler. Hash functions do much more than just avoiding collisions. For one, they have complex mechanisms to allow arbitrary length inputs, and they ensure no collision is ever found by an attacker. I just need no collision to happen accidentally on consecutive calls. – MaiaVictor Dec 08 '17 at 20:12
  • exactly so what is the problem with the b-tree mapping function , that is simple - 3 operations. – Hogan Dec 08 '17 at 20:13
  • @Hogan I'm not familiar with that, I assumed you meant it wouldn't work. Looks like a good idea perhaps. Could you write those 3 operations? (But wouldn't it have a limited depth on which the result would overflow the 64-bit range? Hash functions keep the numbers well within the 64-bit range... also, isn't it an `Int64 -> (Int64,Int64,Int64,Int64)` function? That has only one input rather than two... I'd just discard the other?) – MaiaVictor Dec 08 '17 at 20:14
  • Ah, I could just use the mapping function of a binary tree on each input separately, `branch : Int64 -> (Int64,Int64)`, thus `combine(a,b) = (branch(a), branch(b))`... only problem would be the Int64 overflow when deep enough. Taking the modulus would cause collisions I guess...? – MaiaVictor Dec 08 '17 at 20:20
  • exactly -- and as you said you can only have 2^16 depth. also I would do (branch(a), branch(a+1)) since the right leaf should always be one more than the left leaf. or you could have ( a, a+1, a+2, a+3) if you want to do a 4 leaf b-tree. – Hogan Dec 08 '17 at 21:17

1 Answers1

0

If you have the space to keep an array with the universe of possible entries, you can just randomize it, then keep track of the indices that bound your set, incrementing the starting index to remove elements and the ending index to add them.

You can either randomize the whole array up front, or as the ending index gets incremented.

Dave
  • 7,460
  • 3
  • 26
  • 39