6

I have a weighted choice algorithm that works, but I'd like to improve it in two aspects (in order of importance):

  1. Guarantee that a minimum number from each possible choice is chosen.
  2. Reduce computational complexity from O(nm) to O(n) or O(m), where n is the requested number of randomly chosen items and m is types of items available.

Edit: The number of requested numbers is typically small (less than 100) for my purposes. As such, algorithms with complexity O(t) or O(t+n), where t is the total number of items, generally perform worse than O(nm) due to O(t) > O(m).

Simplified code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Security.Cryptography;

public class Program
{
    static void Main(string[] args)
    {
        // List of items with discrete availability
        // In this example there is a total of 244 discrete items and 3 types, 
        // but there could be millions of items and and hundreds of types. 
        List<Stock<string>> list = new List<Stock<string>>();
        list.Add(new Stock<string>("Apple", 200));
        list.Add(new Stock<string>("Coconut", 2));
        list.Add(new Stock<string>("Banana", 42));

        // Pick 10 random items
        // Chosen with equal weight across all types of items
        foreach (var item in Picker<string>.PickRandom(10, list))
        {
            // Do stuff with item
            Console.WriteLine(item);
        }
    }
}

// Can be thought of as a weighted choice
// where (Item Available) / (Sum of all Available) is the weight.
public class Stock<T>
{
    public Stock(T item, int available)
    {
        Item = item;
        Available = available;
    }
    public T Item { get; set; }
    public int Available { get; set; }
}

public static class Picker<T>
{
    // Randomly choose requested number of items from across all stock types
    // Currently O(nm), where n is requested number of items and m is types of stock
    // O(n) or O(m) would be nice, which I believe is possible but not sure how
    // O(1) would be awesome, but I don't believe it is possible
    public static IEnumerable<T> PickRandom(int requested, IEnumerable<Stock<T>> list)
    {
        // O(m) : to calcuate total items,
        // thus implicitly have per item weight -> (Item Available) / (Total Items)
        int sumAll = list.Sum(x => x.Available);

        // O(1)
        if (sumAll < requested)
            throw new ArgumentException("Requested amount must not exceed total available");

        // O(1)
        Random rng = new Random(Seed());

        // O(n) for the loop alone : O(nm) total
        for (int n = 0; n < requested; n++)
        {
            // O(1) to choose an item : uses implicit ordering
            int choice = rng.Next(1, sumAll);
            int current = 0;

            // O(m) to find the chosen item
            foreach (Stock<T> stock in list)
            {
                current += stock.Available;

                if (current >= choice)
                {
                    yield return stock.Item;

                    // O(1) to re-calculate weight once item is found
                    stock.Available -= 1;
                    sumAll--;

                    break;
                }
            }
        }
    }

    // Sufficiently random seed
    private static int Seed()
    {
        byte[] bytes = new byte[4];
        new RNGCryptoServiceProvider().GetBytes(bytes);
        return bytes[0] << 24 | bytes[1] << 16 | bytes[2] << 8 | bytes[3];
    }
}

The function PickRandom() uses yield return and IEnumerable, but is not required. I was just trying to be clever when I first wrote the function, so that it could iterate through anything (even say a enumerable from a LINQ to SQL query). Afterwards, I discovered that while the flexibility is nice, I never truly needed it.

My first thought on resolving point #1 (guarantee that a minimum number from each possible choice is chosen) would be to choose the required minimum from each type in an entirely non-random way, use my existing algorithm to pick the remaining unconstrained part, then shuffle the results together. This seemed the most natural and mimics how I would do something like this in real life, but I'm thinking it's not the most efficient way.

My second idea was to create a result array first, randomly choose indexes to fill in the required minimums first, then fill in the rest using my existing algorithm, but in all my attempts this ended up increasing the "big O" complexity or as a big mess of indexes being recorded everywhere. I still think this approach is possible, I just haven't been able to work it out yet.

Then decided to come here, since this problem seems like it could be abstracted to a fairly generic algorithm, but all the keywords I use to search usually point me to basic weighted random number generation (as opposed to choosing discrete items grouped by type with specific availability). And have not been able to find any that constrain the problem with a minimum choice per item type, while still remaining randomized. So I'm hoping either someone bright can see a simple efficient solution, or someone who's heard this problem before knows some better keywords than I and can point me in the right direction.

Sybeus
  • 1,169
  • 11
  • 18
  • 1
    Have a look at http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/ for a comparison of diffrent approaches. My answer is similar to the *wrg* method which seems to perform best. – dtb Aug 24 '11 at 08:12
  • Appreciate the link, as it nicely shows the different approaches of weighted number generation. Unfortunately, it doesn't show how to ensure a minimum choice, or how to consider shifting weights after each successive choice (the key difference between weighted choice of discrete items and simple weighted random numbers). – Sybeus Aug 25 '11 at 03:09
  • This is very similar to http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights – Jason Orendorff Dec 20 '11 at 16:03

2 Answers2

2

Here's a rough idea; I'm sure it can be further improved:

  1. Assume that each available item has a unique ID in the range [0..sumAll[ where sumAll is the number of items available. So the first apple has ID 0, the last apple has ID 199, the first coconut has ID 200, and so on. Determining sumAll and the subrange of each type is O(m) where m is the number of types.

  2. Pick a random ID (with all IDs having the same weight). Repeat this until you have a set of 10 different IDs. This is O(n) where n is the number of items to pick.

  3. Determine the type of each picked ID using binary search. This is O(n log m).

  4. Remove the picked items from the available items. This is O(m).

To guarantee a minimum number of items picked for each type, picking these and removing them from the available items before step 1 sounds like a good idea. This is O(m).

dtb
  • 213,145
  • 36
  • 401
  • 431
  • I think you missed a subtle issue in your solution. After I pick a random ID in step 2 there is one less to choose from of that type, and I would have to recalculate the ranges from step 1 to perform a binary search on the next iteration. Which means I'm back at _O(nm)_ with no gain. – Sybeus Aug 25 '11 at 02:55
  • This is covered by picking 10 different IDs. There is no need to subtract each picked item individually from the items available; you can do that in at the end for all picked items. – dtb Aug 25 '11 at 11:07
  • @dtb avoid the removal is a good way to elimate the additional complexity; the concern would be that in a general case, given the data in the question where *sumAll* is 244, it will take a *very long time* to 'choose', say, 240 items from this list due to diminishing size of the 'available items' as a proportion of the total range. An option could be to recalculate the weightings after a certain number of duplicates encountered... but this won't change the big Oh of the algorithm. – Kirk Broadhurst Aug 25 '11 at 23:29
  • Right, but the number of available items is much larger than the number of items picked, so the probability of collisions should be very low. – dtb Aug 25 '11 at 23:39
  • May have merit given that for my purposes the number of requested items is relatively small. Do I keep a list of IDs I've already chosen and just repick if there is a duplicate? Just trying to think how it would work if say all 2 Coconuts were chosen, then one of them was chosen a second time. I wouldn't want to just shift to the next type of Stock when I hit a duplicate, as that would be adding extra weight that it didn't begin with. – Sybeus Aug 26 '11 at 01:05
  • Yes, just repack if there is a duplicate. All items have exactly the same chance of being picked, except those that have already been picked which have a zero chance. – dtb Aug 26 '11 at 01:49
  • Honestly, I would have liked a more elegant solution that didn't have to repick sometimes. But after running a few performance tests, doing the binary search, even if it has to repick a couple times, is still faster than my current method. It gets the job done. – Sybeus Aug 28 '11 at 06:31
0

Great question. I think that O(mn) is the basic case, because for each n (number of items) you need to reevaluate the weighting (which is O(m)).

The Picker class seems to always return the same type - you're not mixing types of Stock here. In your example, Stock<string>. So perhaps the Picker class can flatten all your stock into a single list - less memory efficient, more computationally efficient.

public static IEnumerable<T> PickRandom(int requested, IEnumerable<Stock<T>> list)
{
    var allStock = list.SelectMany(item => 
        Enumerable.Range(0, item.Available).Select(r => item.Item)).ToList();

    Random rng = new Random(); 

    for (int n = 0; n < requested; n++) 
    { 
        int choice = rng.Next(0, allStock.Count - 1);

        var result = allStock[choice];
        allStock.RemoveAt(choice);

        yield return result;
    }  
}

The loss here is that you aren't altering the original Stock objects, but that's an implementation that you can make (your sample shows the Stock objects being created as anonymous parameters to the Picker).

edit

Here's another sample that works very similarly to your existing code. It will create a dictionary in which you can look up your choice. But again, the dictionary (which controls the weighting) needs to be reevaluated after each selection, leading to O(mn).

public static IEnumerable<T> PickRandom(int requested, IEnumerable<Stock<T>> list)
{
    Random rng = new Random();
    for (int n = 0; n < requested; n++)
    {
        int cumulative = 0;
        var items = list.ToDictionary(item => 
            new { Start = cumulative, End = cumulative += item.Available });

        int choice = rng.Next(0, cumulative - 1);
        var foundItem = items.Single(i => 
            i.Key.Start <= choice && choice < i.Key.End).Value;

        foundItem.Available--;
        yield return foundItem.Item;
    }
}

Logically, is it possible to re-evaluate the weighting without considering all your categories?

Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169
  • I agree that the weighting likely needs to be re-evaluated for each _n_ choice; however, I do not agree each re-evaluation needs to be _O(m)_. Case in point, my existing solution does better than that. The first evaluation is _O(m)_, then re-evaluates each iteration with `stock.Available -= 1; sumAll--;`, which is a _O(1)_ set of operations. If a clever enough improvement to pick the stock is found, instead of having to loop _m_ times to find it, then there is a chance to retain this property and lower the complexity (which dtb's answer nearly did). – Sybeus Aug 25 '11 at 05:54
  • I'd need much larger data sets before this approach yield benefits computationally, but by then it would likely slow down from the memory allocation being performed. Lets call the number of items _t_. _t_ will always be greater than _m_ since _m_ is a grouping of _t_. Thus O(t) > O(m). Something I should add to my original question is that _n_ is typically small (less than 100) for my purposes; thus even with 1,000,000 items, 1,000 types, and 100 requested objects, _O(mn)_ would still beat _O(t+n)_ computationally. If I can find a _O(m+n)_ or _O(n log m)_ solution, then it would be no contest. – Sybeus Aug 25 '11 at 05:58
  • @Sybeus - Your existing solution is O(mn), as you've noted. O(n) for each selection, and then each selection O(m) to find the appropriate group. I fail to see how your solution does better than O(m) to re-evalate the weighting after each choice. I am happy to be shown to be wrong! – Kirk Broadhurst Aug 25 '11 at 06:52
  • I understand your point re: size of data sets and number of items being picked; I didn't know this when answering ;) – Kirk Broadhurst Aug 25 '11 at 06:54
  • Sorry for not including the typical usage of number of items picked, you couldn't have known otherwise. It's been added to the question now. As for how I came to the conclusion that I have recalculated weight in _O(1)_ per _n_ selection, I've edited my question to include complexity notes. – Sybeus Aug 25 '11 at 14:47
  • Yes, obviously the arithmetic around adjusting the weighting is O(1). Sorry to confuse you - perhaps I wasn't clear enough. I was considering the bigger component of *finding category & adjusting weighting*, which you note in your question `// O(m) to find the chosen item foreach (Stock stock in list)` Your algorithim is, as you've noted, O(mn). – Kirk Broadhurst Aug 25 '11 at 23:26
  • I believe I see the confusion. In my solution, finding the chosen category and re-evaluating category weights are distinct steps, thus I don't see them as a single component. The fact that the re-evaluation occurs within the loop is purely structural. If you unwind the loop logic, you'll see that re-evaluation occurs at most once after I choose an item. In other words, a subsequent step that is _O(1)_. Thus your first paragraph confused me when you said re-evaluation is _O(m)_. – Sybeus Aug 26 '11 at 00:50