1

I have a method that yields an IEnumerable, but it uses the yield keyword to return elements when executed. I don't always know how big the total collection is. It's kind of similar to the standard Fibonacci example when you go to Try .NET except that it will yield a finite number of elements. That being said, because there's no way of knowing how many elements it will return beforehand, it could keep yielding pretty much forever if there are too many.

When I looked for other questions about this topic here, one of the answers provided a clean LINQ query to randomly sample N elements from a collection. However, the assumption here was that the collection was static. If you go to the Try .NET website and modify the code to use the random sampling implementation from that answer, you will get an infinite loop.

public static void Main()
{
    foreach (var i in Fibonacci().OrderBy(f => Guid.NewGuid()).Take(20))
    {
        Console.WriteLine(i);
    }
}

The query tries to order all the elements in the returned IEnumerable, but to order all the elements it must first calculate all the elements, of which there are an infinite number, which means it will keep going on and on and never return an ordered collection.

So what would be a good strategyfor randomly sampling an IEnumerable with an unknown number of contained elements? Is it even possible?

JansthcirlU
  • 688
  • 5
  • 21
  • So the question is *"How to select N random records from sequence having infinite length"*? Are there any constraints regarding time or memory? – Theodor Zoulias Jan 13 '22 at 08:54
  • @TheodorZoulias Hmm, I hadn't really considered time or memory, I was just wondering if it could be done theoretically or practically. Matthew's answer concerning reservoir sampling seems promising, but with a huge collection it might take too long regardless. – JansthcirlU Jan 13 '22 at 08:57
  • 2
    Theoretically it's easy. You don't have to be Georg Cantor to deduce that sampling an infinite sequence will take infinite time to complete (in other words it will never complete). – Theodor Zoulias Jan 13 '22 at 09:03
  • @TheodorZoulias would you prefer if I rephrased my question to *pseudorandom* sample? – JansthcirlU Jan 13 '22 at 10:14
  • JansthcirlU I don't think that it would make a difference. When we say "random" in a programming context, it is implied that we talk about pseudorandom generators. – Theodor Zoulias Jan 13 '22 at 10:25

1 Answers1

2

If a sequence is infinite in length, you can't select N elements from it in less than infinite time where the chance of each element in the sequence being selected is the same.

However it IS possible to select N items from a sequence of unknown, but finite, length. You can do that using reservoir sampling.

Here's an example implementation:

/// <summary>
/// This uses Reservoir Sampling to select <paramref name="n"/> items from a sequence of items of unknown length.
/// The sequence must contain at least <paramref name="n"/> items.
/// </summary>
/// <typeparam name="T">The type of items in the sequence from which to randomly choose.</typeparam>
/// <param name="items">The sequence of items from which to randomly choose.</param>
/// <param name="n">The number of items to randomly choose from <paramref name="items"/>.</param>
/// <param name="rng">A random number generator.</param>
/// <returns>The randomly chosen items.</returns>

public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
    // See http://en.wikipedia.org/wiki/Reservoir_sampling for details.

    var result = new T[n];
    int index = 0;
    int count = 0;

    foreach (var item in items)
    {
        if (index < n)
        {
            result[count++] = item;
        }
        else
        {
            int r = rng.Next(0, index + 1);

            if (r < n)
                result[r] = item;
        }

        ++index;
    }

    if (index < n)
        throw new ArgumentException("Input sequence too short");

    return result;
}

This must still iterate over the entire sequence, however, so it does NOT work for an infinitely long sequence.


If you want to support input sequences longer than 2^31, you can use longs in the implementation like so:

public static T[] RandomlySelectedItems<T>(IEnumerable<T> items, int n, System.Random rng)
{
    // See http://en.wikipedia.org/wiki/Reservoir_sampling for details.

    var  result = new T[n];
    long index  = 0;
    int  count  = 0;

    foreach (var item in items)
    {
        if (index < n)
        {
            result[count++] = item;
        }
        else
        {
            long r = rng.NextInt64(0, index + 1);

            if (r < n)
                result[r] = item;
        }

        ++index;
    }

    if (index < n)
        throw new ArgumentException("Input sequence too short");

    return result;
}

Note that this implementation requires .Net 6.0 or higher because of the rng.NextInt64().

Also note that there's no point in making n long because you can't have an array that exceeds ~2^31 elements, so it wouldn't be possible to fill it. You could in theory fix that by returning multiple arrays, but I'll leave that as an exercise for the reader. ;)

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • Awesome stuff, I'll check it out and report back if I've found any success! – JansthcirlU Jan 13 '22 at 08:47
  • Probably better is `checked { ++index; }`, in order to avoid an unintended overflow after the source has emitted `Int32.MaxValue` elements. – Theodor Zoulias Jan 13 '22 at 08:59
  • @TheodorZoulias You could do that, but if you want to handle very long sequences it would be better to make `n`, `r` and `index` all `long` and use `rng.NextInt64()` instead of `rng.Next()`, then it would actually work rather than thow an overflow exception. (It would work as long as `n` is less than the maximum number of elements you can store in an array.) – Matthew Watson Jan 13 '22 at 09:54
  • Matthew if we want to support sequences with arbitrary length, even the `long` type may not cut it eventually. In order to avoid a nonsensical `ArgumentOutOfRangeException` when the `index` overflows (at the `rng.Next` line) we must either do a `checked` increment and get a meaningful runtime error, or switch to a [`BigInteger`](https://stackoverflow.com/questions/17357760/how-can-i-generate-a-random-biginteger-within-a-certain-range) for the `index`. – Theodor Zoulias Jan 13 '22 at 10:18
  • @TheodorZoulias If we exceed 2^63 items, it would take years to process, so I don't think we need to worry about that. :) – Matthew Watson Jan 13 '22 at 11:02
  • If you could process one billion items per second, it would take nearly 300 years to process 2^63 items... – Matthew Watson Jan 13 '22 at 11:09
  • @MatthewWatson regardless of whether or not my method ends up generating 10^2 or 10^19 elements, would be nice to know if the calculation would still work if the time and memory was available to carry it out (which of course it won't be, realistically). – JansthcirlU Jan 13 '22 at 11:20
  • *"it would take nearly 300 years to process 2^63 items"* -- I am patient guy, I can wait. But I will be very upset if in the year 2322 the program crashes with an `ArgumentOutOfRangeException`. – Theodor Zoulias Jan 13 '22 at 11:28
  • 1
    @JansthcirlU The code in my answer will work for more than 10^19 elements if you use `long` for some of the values as I noted above. Completely pointless since you'd never run it to completion, but it wouldn't crash. :) – Matthew Watson Jan 13 '22 at 12:22