I think your two questions are independent and should have been separate questions. Also I'm not sure that I have understood them correctly. But here we go:
Sharded weighted random query
If your reference to sharding has to do with distributing the item store over multiple network hosts and attempting to do some sort of network parallel random select, then you can use the modified reservoir sample algorithm which I outline at the end of this answer.
That algorithm was originally developed for use in a redundant network where the various storage hosts are not necessarily reachable directly from the central host and the connectivity is a graph, not a tree. In that case, you need to be able to deal with hosts which do not respond (which will bias an individual query, but if the network faults are infrequent and random will hopefully not bias a long series of queries). It's also necessary to deal with the possibility that a host is queried twice; in the outlined algorithm, I simply ignore the second and subsequent queries on the assumption that if a query reaches a host, then the response is likely to be returned to the querying host. That could be completely wrong, but it makes the problem much easier and again it's probably unbiased over a sufficiently large number of queries.
Without the complications, if the central host can reliably connect to every storage host, the algorithm is straight-forward. The central host queries all storage hosts in parallel, and each storage host returns a tuple of the total weight of objects it stores, and one object randomly selected according to those weights. (It uses some standard weighted random selection algorithm to do that. Which algorithm is used will depend on how often the objects and weights change.)
The central host maintains two variables: total
, the sum of weights from servers which have responded (initially 0), and candidate
, a random object which might be returned (initially some sentinel indicating "no object").
It handles the responses one at a time, in any order (such as the order it receives them). For each response <weight, object>
, it does the following:
total
← total
+ weight
r
← a random integer in the range [0, total)
- if
r
< weight
: candidate
← object
When it decides that all remote servers have responded, it returns candidate
.
Weighted random shuffle
(At least, I think you're asking for a weighted random shuffle).
I was going to suggest using the standard Fisher-Yates algorithm modified with weights, which I think will yield the sampling behaviour you expect. To do that, you start with the objects in any arbitrary order, and then for each value of i
from 1
to n
:
- select
j
, the index of a (weighted) random element from the objects starting at i
, and swap objects i
and j
.
To do that, you need to maintain the CDF of the successively smaller suffixes, which you can do in O(log N) by keeping the objects in a binary tree. (Or you can do it much more simply in O(N), if N is smallish.)
However, I did a quick search on SO before hitting the Post button, and concluded that this brilliant answer is actually better, because it achieves O(N log N) without any complicated data structure: For each object, you generate a random number from an exponential distribution whose rate is the corresponding weight. Then you sort the objects according to those random numbers, and the result is the random shuffle.