1

I have an array of boolean values and need to randomly select a specific quantity of indices for values which are true.

What is the most efficient way to generate the array of indices?

For instance,

BitArray mask = GenerateSomeMask(length: 100000);
int[] randomIndices = RandomIndicesForTrue(mask, quantity: 10);

In this case the length of randomIndices would be 10.

Lea Hayes
  • 62,536
  • 16
  • 62
  • 111

3 Answers3

3

There's a faster way to do this that requires only a single scan of the list.

Consider picking a line at random from a text file when you don't know how many lines are in the file, and the file is too large to fit in memory. The obvious solution is to read the file once to count the lines, pick a random number in the range of 0 to Count-1, and then read the file again up to the chosen line number. That works, but requires you to read the file twice.

A faster solution is to read the first line and save it as the selected line. You replace the selected line with the next line with probability 1/2. When you read the third line, you replace with probability 1/3, etc. When you've read the entire file, you have selected a line at random, and every line had equal probability of being selected. The code looks something like this:

string selectedLine = null;
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
    ++numLines;
    double prob = 1.0/numLines;
    if (rnd.Next() >= prob)
        selectedLine = line;
}

Now, what if you want to select 2 lines? You select the first two. Then, as each line is read the probability that it will replace one of the two lines is 2/n, where n is the number of lines already read. If you determine that you need to replace a line, you randomly select the line to be replaced. You can follow that same basic idea to select any number of lines at random. For example:

string[] selectedLines = new int[M];
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
    ++numLines;
    if (numLines <= M)
    {
        selectedLines[numLines-1] = line;
    }
    else
    {
        double prob = (double)M/numLines;
        if (rnd.Next() >= prob)
        {
            int ix = rnd.Next(M);
            selectedLines[ix] = line;
        }
    }
}

You can apply that to your BitArray quite easily:

int[] selected = new int[quantity];
int num = 0;  // number of True items seen
Random rnd = new Random();
for (int i = 0; i < items.Length; ++i)
{
    if (items[i])
    {
        ++num;
        if (num <= quantity)
        {
            selected[num-1] = i;
        }
        else
        {
            double prob = (double)quantity/num;
            if (rnd.Next() > prob)
            {
                int ix = rnd.Next(quantity);
                selected[ix] = i;
            }
        }
    }
}

You'll need some special case code at the end to handle the case where there aren't quantity set bits in the array, but you'll need that with any solution.

This makes a single pass over the BitArray, and the only extra memory it uses is for the list of selected indexes. I'd be surprised if it wasn't significantly faster than the LINQ version.

Note that I used the probability calculation to illustrate the math. You can change the inner loop code in the first example to:

if (rnd.Next(numLines+1) == numLines)
{
    selectedLine = line;
}
++numLines;

You can make a similar change to the other examples. That does the same thing as the probability calculation, and should execute a little faster because it eliminates a floating point divide for each item.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
2

There are two families of approaches you can use: deterministic and non-deterministic. The first one involves finding all the eligible elements in the collection and then picking N at random; the second involves randomly reaching into the collection until you have found N eligible items.

Since the size of your collection is not negligible at 100K and you only want to pick a few out of those, at first sight non-deterministic sounds like it should be considered because it can give very good results in practice. However, since there is no guarantee that N true values even exist in the collection, going non-deterministic could put your program into an infinite loop (less catastrophically, it could just take a very long time to produce results).

Therefore I am going to suggest going for a deterministic approach, even though you are going to pay for the guarantees you need through the nose with resource usage. In particular, the operation will involve in-place sorting of an auxiliary collection; this will practically undo the nice space savings you got by using BitArray.

Theory aside, let's get to work. The standard way to handle this is:

  1. Filter all eligible indices into an auxiliary collection.
  2. Randomly shuffle the collection with Fisher-Yates (there's a convenient implementation on StackOverflow).
  3. Pick the N first items of the shuffled collection. If there are less than N then your input cannot satisfy your requirements.

Translated into LINQ:

var results = mask
    .Select((i, f) => Tuple.Create)  // project into index/bool pairs
    .Where(t => t.Item2)             // keep only those where bool == true
    .Select(t => t.Item1)            // extract indices
    .ToList()                        // prerequisite for next step
    .Shuffle()                       // Fisher-Yates
    .Take(quantity)                  // pick N
    .ToArray();                      // into an int[]

if (results.Length < quantity)
{
    // not enough true values in input
}
Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • Thanks for the detailed answer - this makes a lot of sense :) – Lea Hayes Mar 20 '14 at 22:51
  • There's another approach that you didn't consider. See my answer. – Jim Mischel Mar 23 '14 at 03:29
  • 1
    @JimMischel: I 've come across that approach in the past, but evidently didn't take sufficient notice because it's certainly the very best you can do in this situation. You have both my upvote and my thanks for the reminder. – Jon Mar 24 '14 at 12:11
0

If you have 10 indices to choose from, you could generate a random number from 0 to 2^10 - 1, and use that as you mask.

pkr
  • 1,723
  • 5
  • 25
  • 43
  • I should clarify, the `mask` contains 100,000's of elements and I am only interested in the indices of 10 elements which are `true`. – Lea Hayes Mar 20 '14 at 22:11
  • @LeaHayes: How many of those are expected to be selectable? Are you always going to want just 10? The go-to answer in these cases involves filtering, shuffling the results with Fisher-Yates and picking the first 10 -- but filter/shuffle on 100K items just to pick 10 at random doesn't sound very good. – Jon Mar 20 '14 at 22:14
  • @Jon: The number of required indices from a particular mask is user configurable and ranges between 1 and 361. – Lea Hayes Mar 20 '14 at 22:16
  • @LeaHayes: Expected/minimum number of `true` values in the mask? This is necessarily going to involve a tradeoff, so you want to pick the correct one. – Jon Mar 20 '14 at 22:17
  • @Jon: The mask is essentially a user-drawn bitmap and could contain any value `true` or `false`. – Lea Hayes Mar 20 '14 at 22:18
  • @LeaHayes: Maybe not the most efficient, but for less than 361 values, I would just generate 361 unique random values from 0 to n-1 and just take care to remove duplicates. – pkr Mar 20 '14 at 22:19