2

There are many typical questions like https://softwareengineering.stackexchange.com/questions/150616/return-random-list-item-by-its-weight

Imagine more advanced problem.

You have N sources of pair (item_id, weight) information. Let's call them Shards. Shards contain lists of pairs (item_id, weight).

And you have central node, let's call it Central.

The problem is: on Central choose random item from The Big List (the list virtually merged from all lists on all shards) according to their weight through all weights.

For example, we have two shards:

+-------+---------+--------+ | shard | item_id | weight | +-------+---------+--------+ | 1 | 1 | 7 | | 1 | 2 | 4 | | 1 | 3 | 2 | | 2 | 4 | 5 | | 2 | 5 | 1 | +-------+---------+--------+

(Let item_id will be unique through all shards.)

First problem:

How to choose item_id randomly but weighted through all shards? I.e. total_weight == 7+4+2+5+1 == 19, so 1 will be chosen with probability of 7/19, 2 - 4/19, 3 - 2/19 and so on.

Second problem:

How to range all items from all shards randomly, but weighted through all shards?

I.e. ideal ranging will be: 1, 4, 2, 3, 5 (according to their weights),

but there may be another ranging like 1, 2, 4, 3, 5, but slightly less frequently than previous,

...

and worst case 5, 3, 2, 4, 1 can also appear, but with very-very little probability.

Is there common problem in computer science for this?

Community
  • 1
  • 1
vladon
  • 8,158
  • 2
  • 47
  • 91
  • 1
    The shards have no relevance here. What you describe as the problem is shuffling a set of weighted items so that heavier ones are more likely to be placed in the beginning. Some languages may provide tools for that, others don't. If your are in the latter case, you can search for weighted shuffling ([example](http://programmers.stackexchange.com/questions/233541/how-to-implement-a-weighted-shuffle)). – Nelfeal Sep 06 '16 at 10:10

2 Answers2

1

For your first problem, you can do something like

// Your pairs (item_id, weight).
ArrayList<Pair> theBigList;
// The total weight of your data. Get it updated.
int totalWeight;

public int weightedSearch() {
    // Local copy of the value.
    int total = totalWeight;
    Random r = new Random();
    // Random integer in the interval [1 - total]
    int random = r.nextInt(total) + 1;
    for (int i = 0; true; i++) {
        if (theBigList.get(i).weigth < total)
            total -= theBigList.get(i).weigth;
        else
            return theBigLits.get(i).itemId;
    }
}

The random search, but weighted, is given by the random integer generated. In your example, if random is between 1 and 7 (7/19 prob.), your first element will be returned; if it is between 8 and 11 (4/19 prob.), second element is returned, etc.

Tip: Get your heavy items in the beginning, so you have more probability to return faster the weighted search (your loop ends before).

Santiago Gil
  • 1,292
  • 7
  • 21
  • 52
1

I think your two questions are independent and should have been separate questions. Also I'm not sure that I have understood them correctly. But here we go:

Sharded weighted random query

If your reference to sharding has to do with distributing the item store over multiple network hosts and attempting to do some sort of network parallel random select, then you can use the modified reservoir sample algorithm which I outline at the end of this answer.

That algorithm was originally developed for use in a redundant network where the various storage hosts are not necessarily reachable directly from the central host and the connectivity is a graph, not a tree. In that case, you need to be able to deal with hosts which do not respond (which will bias an individual query, but if the network faults are infrequent and random will hopefully not bias a long series of queries). It's also necessary to deal with the possibility that a host is queried twice; in the outlined algorithm, I simply ignore the second and subsequent queries on the assumption that if a query reaches a host, then the response is likely to be returned to the querying host. That could be completely wrong, but it makes the problem much easier and again it's probably unbiased over a sufficiently large number of queries.

Without the complications, if the central host can reliably connect to every storage host, the algorithm is straight-forward. The central host queries all storage hosts in parallel, and each storage host returns a tuple of the total weight of objects it stores, and one object randomly selected according to those weights. (It uses some standard weighted random selection algorithm to do that. Which algorithm is used will depend on how often the objects and weights change.)

The central host maintains two variables: total, the sum of weights from servers which have responded (initially 0), and candidate, a random object which might be returned (initially some sentinel indicating "no object").

It handles the responses one at a time, in any order (such as the order it receives them). For each response <weight, object>, it does the following:

  • totaltotal &plus; weight
  • r ← a random integer in the range [0, total)
  • if r < weight: candidateobject

When it decides that all remote servers have responded, it returns candidate.

Weighted random shuffle

(At least, I think you're asking for a weighted random shuffle).

I was going to suggest using the standard Fisher-Yates algorithm modified with weights, which I think will yield the sampling behaviour you expect. To do that, you start with the objects in any arbitrary order, and then for each value of i from 1 to n:

  • select j, the index of a (weighted) random element from the objects starting at i, and swap objects i and j.

To do that, you need to maintain the CDF of the successively smaller suffixes, which you can do in O(log N) by keeping the objects in a binary tree. (Or you can do it much more simply in O(N), if N is smallish.)

However, I did a quick search on SO before hitting the Post button, and concluded that this brilliant answer is actually better, because it achieves O(N log N) without any complicated data structure: For each object, you generate a random number from an exponential distribution whose rate is the corresponding weight. Then you sort the objects according to those random numbers, and the result is the random shuffle.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341