13

I am trying to write a program to select a random name from the US Census last name list. The list format is

Name           Weight Cumulative line
-----          -----  -----      -
SMITH          1.006  1.006      1
JOHNSON        0.810  1.816      2
WILLIAMS       0.699  2.515      3
JONES          0.621  3.136      4
BROWN          0.621  3.757      5
DAVIS          0.480  4.237      6

Assuming I load the data in to a structure like

Class Name
{
    public string Name {get; set;}
    public decimal Weight {get; set;}
    public decimal Cumulative {get; set;}
}

What data structure would be best to hold the list of names, and what would be the best way to select a random name from the list but have the distribution of names be the same as the real world.

I will only be working with the first 10,000 rows if it makes a difference in the data structure.

I have tried looking at some of the other questions about weighted randomness but I am having a bit of trouble turning theory in to code. I do not know much about math theory so I do not know if this is a "With or without replacement" random selection, I want the same name able to show up more than once, which ever that one means.

Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431

4 Answers4

10

The "easiest" way to handle this would be to keep this in a list.

You could then just use:

Name GetRandomName(Random random, List<Name> names)
{
    double value = random.NextDouble() * names[names.Count-1].Culmitive;
    return names.Last(name => name.Culmitive <= value);
}

If speed is a concern, you could store a separate array of just the Culmitive values. With this, you could use Array.BinarySearch to quickly find the appropriate index:

Name GetRandomName(Random random, List<Name> names, double[] culmitiveValues)
{
    double value = random.NextDouble() * names[names.Count-1].Culmitive;
    int index = Array.BinarySearch(culmitiveValues, value);
    if (index >= 0)
        index = ~index;

    return names[index];
}

Another option, which is probably the most efficient, would be to use something like one of the C5 Generic Collection Library's tree classes. You could then use RangeFrom to find the appropriate name. This has the advantage of not requiring a separate collection

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
  • Your first implantation will be fast enough for what I need to do, thanks! – Scott Chamberlain Sep 09 '11 at 20:28
  • We arrived at this same solution. Furthermore, we implemented an efficiency wrapper around the NextDouble to spread the information across several picks of GetRandomName (don't need 32 bits information to pick from 6 choices). – gap Nov 19 '13 at 17:10
  • 1
    Looking at this, I feel like the Binary Search answer needs a different sign on the if statement. If the index is at or above zero, use that answer. If it is below zero do the bitwise complement (~) to get the first element (if any) that is larger than the given search value (according to the Array.BinarySearch docs). – jbarz Jan 29 '20 at 04:13
4

I've created a C# library for randomly selected weighted items.

  • It implements both the tree-selection and walker alias method algorithms, to give the best performance for all use-cases.
  • It is unit-tested and optimized.
  • It has LINQ support.
  • It's free and open-source, licensed under the MIT license.

Some example code:

IWeightedRandomizer<string> randomizer = new DynamicWeightedRandomizer<string>();
randomizer["Joe"] = 1;
randomizer["Ryan"] = 2;
randomizer["Jason"] = 2;

string name1 = randomizer.RandomWithReplacement();
//name1 has a 20% chance of being "Joe", 40% of "Ryan", 40% of "Jason"

string name2 = randomizer.RandomWithRemoval();
//Same as above, except whichever one was chosen has been removed from the list.
BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
0

I'd say an array (vectors if you prefer) would be best to hold them. As for the weighted average, find the sum, pick a random number between zero and the sum, and pick the last name whose cumulative value is less. (e.g. here, <1.006 = smith, 1.006-1.816 = johnson, etc.

P.S. it's Cumulative.

Kevin
  • 53,822
  • 15
  • 101
  • 132
0

Just for fun, and in no way optimal

List<Name> Names = //Load your structure into this

List<String> NameBank = new List<String>();
foreach(Name name in Names)
   for(int i = 0; i <= (int)(name.Weight*1000); i++)
     NameBank.Add(name.Name)

then:

String output = NameBank[rand(NameBank.Count)];
normanthesquid
  • 690
  • 1
  • 6
  • 21