0

I have a discontinuous list of numbers N (e.g. { 1, 2, 3, 6, 8, 10}) and i need to progressively create random pairs of numbers in N and store them in a list in which there can't be twice the same pair.

For example for a list of 3 different of numbers there is 6 possible pair (not counting same number pair):

example for the list { 4, 8, 9 }, the possible pairs are:

(4,8) (4,9) (8,4) (8,9) (9,4) (9,8)

When we arrive to a number list size of 30 for example, we get 870 possible pairs and with my current method I get less and less efficient the more possible pairs there are.

For now my strategy with a number list of size 30 for example is :

N = { 3, 8, 10, 15, 16, ... } // size = 30

// Lets say I have already a list with 200 different pairs
my_pairs = { (8,16), (23, 32), (16,10), ... }

// Get two random numbers in the list
rn1 = random(N)
rn2 = random(N)

Loop through my_pairs to see if the pair (rn1,rn2) has already been generated

If there is one, we pick two new numbers rn1 & rn2 at random and retry adding them to my_pairs

If not then we add it to the list

The issue is that the more pairs we have in my_pairs, the less likely it is for a pair to not be in that list. So we have to check multiple random pairs multiple times and go through the list every time.

I could try to generate all possible pairs at the start, shuffle the list and pop one element each time I need to add a random pair to my list. But it will take a lot of space to store all possible pairs when my Numbers list size is increasing (like 9900 possible pairs for 100 different numbers). And I add numbers in N during my process so I can't afford to recalculate all possible pairs every time.

Is there an algorithm for generating random unique pairs ?

Maybe it would be faster using matrices or storing my pairs in some sort of a tree graph ?

subski
  • 57
  • 7
  • 1
    Store the generated pairs in a hash table, not a linear list. – rici Aug 26 '21 at 05:34
  • Is it possible that you add a number in N that was previously removed ? Are there min and max to the possible numbers ? What percentage of all possible pairs will you use ? A small one ? A big one ? Is the size of N constant ? You tagged math and graph, but seems to be focused on practical implementation, any reason why ? – Alois Christen Aug 26 '21 at 06:09
  • @rici Even with an hash table, I will still generate pairs that are already in the hash table and i will have to do multiple tries and check with the entire hash table every time. – subski Aug 26 '21 at 06:09
  • 2
    @AloisChristen Actually i corrected myself, I don't remove numbers from N, only add. There are no min and max, they just have to be >0. I use between ~10% and ~70% of all possible pairs, it vary a lot. Usually I start with a small N size and add numbers gradually. The size of N can reach up to 10k sometimes more. And I mentioned the math and graph tags because I was maybe thinking about using matrices or tree graphs data-structures. – subski Aug 26 '21 at 06:18
  • You write *"I use between ~10% and ~70% of all possible pairs"*, but also write *"it will take a lot of space to store all possible pairs"*: If the number of possible pairs takes a lot of space, then 50% of those is also a lot of space. So I don't see why you should worry about that. – trincot Aug 26 '21 at 06:25
  • 1
    @subski: the point of the hash table is that the time to "check with the entire hash table" doesn't actually increase as the hash table gets bigger. (On average, anyway) Also, if you have used up 2/3 of the possibilities, it will on average take three tries to find an unused element. And that's roughly the worst case, according to your comment. Over the entire problem expect the rejection strategy to be less than twice as slow as not checking fir duplicates. – rici Aug 26 '21 at 06:31
  • 1
    Both full generation of pairs and hashtable seems fine. Depending on how much you need to optimize, you could even go for an hybrid solution (start with hash table, switch to all pairs beyond a fixed % ) If you generate all pairs, [this structure](https://www.geeksforgeeks.org/design-a-data-structure-that-supports-insert-delete-search-and-getrandom-in-constant-time/) could help – Alois Christen Aug 26 '21 at 07:29
  • @trincot 50% of all possible pairs is a lot but with all possible pairs I just triple the size allocated. Also I would have to recompute the all possible pairs list every time I add a number to N (my number list) – subski Aug 26 '21 at 07:46
  • @rici yea you are right, it can only start become an issue if I use 80%+ of all possible pairs and I can even handle the generation differently after that threshold like Alois Christen mentioned. – subski Aug 26 '21 at 07:51
  • 2
    Just to be precise if you go with all possible pairs, when adding a new element you only need to generate the possible pairs with the added element. As you never remove element, the existing pairs stay valid until selected. – Alois Christen Aug 26 '21 at 07:56
  • @AloisChristen yea i thought about that as well. I'm actually implement the different methods (the one I descibed in the post, the all possible list, your hybrid version and a method with a 2D matrix) and see which one performs the best in my case. – subski Aug 26 '21 at 08:24
  • 1
    Fwiw, here's a variation on the hybrid solution which I proposed some time ago: https://stackoverflow.com/a/44318794/1566221. It assumes that you can easily enumerate the possible values, which is trickier if the set of values can increase mid-sample. However, there is a straightforward enumeration of pairs of indices in order by maximum value in the pair. – rici Aug 26 '21 at 21:00
  • @rici thx! this is actually pretty useful. i'm going for an hybrid solution with a way to get a pair from a single index that should perform well, i'll post my solution here when I've done some testing. – subski Aug 27 '21 at 01:16

1 Answers1

1

It depends a lot on what you want to optimize for.

If you want to keep things simple and easy to maintain, having a hash set of all generated numbers sounds reasonable. The assumption here is that both checking membership and adding a new element should be O(1) on average.

If you worry about space requirements, because you regularly use up to 70% of the possible pairs, then you could optimize for space. To do that, I'd first establish a mapping between each possible pair and a single integer. I'd do so in a way that allows for easy addition of more numbers to N.

      +0     +1
 0  (0,1)  (1,0)
 2  (0,2)  (2,0)
 4  (1,2)  (2,1)
 6  (0,3)  (3,0)
 8  (1,3)  (3,1)
10  (2,3)  (3,2)

Something like this would map a single integer i to a pair (a,b) of indices into your sequence N, which you could then look up in N to turn them into an actual pair of elements. You can come up with formulas for this mapping, although the conversion from i to (a,b) will entail a square root somewhere.

When you have this, the task of picking a pair from a set of arbitrary numbers becomes the task of picking an integer from a continuous range of integers. Now you could use a bit map to very efficiently store for each index whether you have already picked that one in the past. For low percentages of picked pairs that bitmap may be more memory-consuming than a hash map of only picked values would be, but as you approach 70% of all pairs getting picked, it will be way more efficient. I would expect a typical hash map entry to consume at least 3×64=192 bit of storage, so the bitmap will start saving memory once 1/192=0.52% of all values are getting picked. Growing the bit map might still be expensive, so estimating the maximal size of N might help allocating enough memory up front.

If you have a costly random number generator, or worry about the worst case time complexity of the whole thing, then you might want to avoid multiple attempts that might result in already picked pairs. To achieve that you would probably store the set of all picked pairs in some kind of search tree where each node also keeps track of how many leafs its subtree contains. That way you could generate a random number in a range that corresponds to the size of pairs that haven't been picked yet, and then use the information in that tree to add to the chosen value the number of all already picked indices smaller than that. I haven't worked out all details but I believe with this it should be possible to turn this into O(log n) worst case time complexity, as opposed to the O(1) average case but O(n) or even O(∞) worst case we had before.

MvG
  • 57,380
  • 22
  • 148
  • 276
  • thanks for the guideline, I made the function to generate pairs from a single integer and im just finishing to implement the whole thing. I think my solution is pretty efficient and can handle extreme cases very well. I'll post the full code here very soon. – subski Aug 27 '21 at 19:59