1

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.

I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.

Some points worth to mention:

  • the weight has those boundaries: 0 <= w < infinite.
  • the weight is not a static value, it can change over time based on some item properties.
  • every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
  • when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
  • I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)

Hope I didn't miss anything important. Let me know if I did.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Adrian
  • 693
  • 5
  • 19
  • What do you mean by "chance to be displayed"? How do you imagine this in SQL? – Anton Sep 11 '20 at 16:10
  • @Anton, I mean the probability that the item will be displayed to the user should depend on the value of the Weight in comparison with the weights of other items. It shouldn't be a purely SQL solution – Adrian Sep 11 '20 at 16:14
  • But you need a random rows, right? Not just getting the top N rows with maximum weight? – Anton Sep 11 '20 at 16:17
  • @Anton yes, I need them to be random, just the probability to display them to be different – Adrian Sep 11 '20 at 16:19
  • Sounds not an easy task... Anyway I recommend you to split your task into smaller ones: 1) how to select random rows, 2) how to select random rows using `weight` of rows, 3) how to select all that and not select rows, that are already shown to user. So for point 1 I googled: https://stackoverflow.com/questions/8674718/best-way-to-select-random-rows-postgresql – Anton Sep 11 '20 at 16:28
  • @Anton, Thanks, makes sense – Adrian Sep 11 '20 at 16:34

2 Answers2

4

The following are some ideas to implement the solution:

The database table should have a column where each entry is a number generated as follows:

  • log(R) / W,

where—

  • W is the record's weight greater than 0 (itself its own column), and
  • R is a per-record uniform random number in (0, 1)

(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.

However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).

EDIT (Mar. 5, 2023): Note that for this solution, the weight expresses how likely a given item will appear first in the sample. This weight is not necessarily the chance that a given sample of n items will include that item (that is, an inclusion probability). The solution given above will not necessarily ensure that a given item will appear in a random sample with probability proportional to its weight; for that, see "Algorithms of sampling with equal or unequal probabilities".

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • @Peter O. But how it would work in case the Weight changes, can I store only log(R) and calculate log(R) / W at query time somehow? Or should I update the log(R) / W column periodically (however I don't know if this option would work in my case) – Adrian Sep 11 '20 at 16:49
  • Yes, you update the weight and the `log(R) / W` column. And indeed you can choose to store `log(R)` instead of `R` if it's more efficient (and perhaps calculate `log(R) / W` at query time). – Peter O. Sep 11 '20 at 17:12
2

Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:

  1. Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
  2. Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
  3. Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.
Nuclearman
  • 5,029
  • 1
  • 19
  • 35