3

I've a an array in which I take out random items

[a, b, c, d ,...]

function getRandomItem(){
   // return random item from array
}

I also have a SQL table like so :

category_id
random_item

Then I want to add that item to the table. For every category_id, I want multiple rows of random item such as:

  • There is no duplicate items in each category (item a cannot figure twice with the category_id 1, but item a can be in category_id 1 and category_id 2)
  • The number of items will be smaller than length of the array. (That's not a requirement that'll just be always the case).

Here is some imaginary code that do just that:

function persist(){
    var a = giveRandomItem();
    // $1 = a
    return execute("INSERT INTO mytable (random_item) values ($1) ON CONFLICT DO NOTHING RETURNING *", a)
}

 // usage
 var persisted;
 while(persisted === undefined){
    persisted = persist();
 }

The problem with this is that it's not constant time. There is a probability that I hit the DB 5 times in a row because the item has already been persisted.

For each category I expect max 5k items and my array length is 400 000. So the probability is quite low though.

I'd like to find a way that is constant time nevertheless, or at least have a sql command that would try multiple values, so as to lower the probability further.


Use case

A simple use case I can think of is this (it is useless but simple):

Users are presented with an interface where they can select a category. Then they can press a button that adds a random item to it. There are multiple users, each acting individually. So user 1 can add a random item to category 1 while user 2 simultaneously adds a random item to category 2

EDIT

I ended up doing something like this:

At the application level:

shuffle(array);

function getRandomItem(seed, inc){
   let index = (seed + inc) % array.length;
   return array[index]
}

// usage:

let seed = item.category_id
let inc = category.item_count

This way I've no duplicates since I said the count of items was lower than the length of the array. Also the items are seemingly random because the id of the category is used as a seed for the start of the increment. However that's only for the starting point and it's therefore not really random but that works for my use case.

Ced
  • 15,847
  • 14
  • 87
  • 146

1 Answers1

2

To guarantee that you don't experience conflicts (unique constraint violations) you should change your approach. Instead of generating one random item at a time you should generate all 5K items at once (and then insert them in bulk). Inserting in bulk would also speed things up considerably.

How to generate 5K random items out of array of 400K items?

One way is to shuffle the array and take first 5K elements. Then the next 5K elements, and so on. This would also guarantee that separate batches don't have repeating elements (until all 400K are exhausted and you start from the beginning of the array again).

If you want elements to have a chance to be repeated between batches, then re-shuffle array between the batches.


After discussion in the comments, it looks like you need an algorithm that generates Cyclic permutations. For each category store in the DB the starting seed/internal state of this algorithm to know how to continue picking elements of 400K array in such a way that they look random, but don't repeat until all 400K elements for the category are picked.

Community
  • 1
  • 1
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • I believe this won't work. I added an use case to my question to better reflect the problem. Your answer won't work because it assumes I can generate everything at once. I cannot, items are generated through users input. If that makes sens. – Ced Apr 18 '17 at 10:46
  • If you insist on inserting elements one-by-one, then you should change the implementation of your `getRandomItem()` function, so that it returns non-repeating elements. One way to do it is to again shuffle the 400K array and each time you call `getRandomItem()` it would return the next item from the array. So, the main idea is to control the randomness generation. Pre-generate the 400K array in a random order and then read from it sequentially when you need a random item. It will guarantee no conflicts until you loop through all 400K elements. – Vladimir Baranov Apr 18 '17 at 10:51
  • It already does that. The issue is that the number of categories is unknown. User 1 can post an item in category 1. Then the index in the randomized array is incremented. Some time passes, while other users post on other categories (and the index in the randomized array is incremented). Then after some time another user post an item to category 1. The thing is that there has been 400 000 items posted in other categories in the meantime and we are back at square 1. Do you see the problem ? The index is not guaranteed to be at a place that is not a duplicate for said category. – Ced Apr 18 '17 at 10:57
  • @Ced, yes, if there were 400K inserts, then elements would begin to repeat. If you have only a handful of categories, then you can pre-generate a shuffled array for each category. If you know that each category would have, say, at most 5K elements, then you can pre-generate only 5K items for each category (not all 400K), save them somewhere in a helper table and retrieve as needed. The main idea still holds - don't leave randomness to a chance, control it. – Vladimir Baranov Apr 18 '17 at 11:04
  • I don't have only a handful of categories actually. Controlling the randomness is why I made this post. – Ced Apr 18 '17 at 11:08
  • @Ced, essentially, you need an algorithm that would walk through all 400K elements in a random order without repetition (the period should be 400K). The starting point is determined by some seed. The algorithm would have some internal state, like a standard [Linear congruential generator](https://en.wikipedia.org/wiki/Linear_congruential_generator). So, for each category you need to store not 400K, not 5K elements, but a starting seed and/or the internal state of the algorithm, so that you could continue picking random values of the sequence where you stopped. – Vladimir Baranov Apr 18 '17 at 11:23
  • This is what I needed, thanks ! Could you put that in your post as I feel like the real answer is in the comments. – Ced Apr 18 '17 at 11:29