1

I have a list with over 5,000,000 items in it. I want to take a random 2,000 items from the list. Is there any way to do this in an efficient manner without shuffling the entire thing first?

I have tried using he PickRandom per below but it shuffles the entire list first, which is taking way too much time and I feel is unnecessary.

public static class EnumerableExtension
{
    public static T PickRandom<T>(this IEnumerable<T> source)
    {
        return source.PickRandom(1).Single();
    }

    public static IEnumerable<T> PickRandom<T>(this IEnumerable<T> source, int count)
    {
        return source.Shuffle().Take(count);
    }

    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source)
    {
        return source.OrderBy(x => System.Guid.NewGuid());
    }
}

Any help would be appreciated.

seemedo
  • 37
  • 2
  • 1
    That is not a proper shuffle nor an ideal way to randomize. An actual Fisher-Yates shuffle is likely to be faster, but the 5 million items is likely the issue. – Ňɏssa Pøngjǣrdenlarp Sep 15 '19 at 20:53
  • Duplicates allowed in returned result or not? – dropoutcoder Sep 15 '19 at 20:57
  • 1
    Back to basics. Shuffle() works on the entire collection but you only need to shuffle the first 2000 items. So re-implement Fisher-Yates and quit early. – Hans Passant Sep 15 '19 at 21:31
  • Apparently the "correct" way to do this is to use Vitter's "Method D", which samples in `O(k)` time where `k` is the number of items to be sampled. The actual algorithm is quite complex, but there is an implementation of it presented here: https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/ – Iridium Sep 15 '19 at 21:37

5 Answers5

1

Generate 2,000 random integers between 0 and 499,999 inclusive, and use them as indices into your large array.

You will need to add a bit of deduplication logic to ensure you didn't accidentally choose the same number twice. In this example I use LINQ-fu to create a distinct list and take the first 2,000.

var random = new Random();
int[] indices;
while (true)
{
    indices = Enumerable.Range(0,3000)
        .Select
        (
            i => random.Next(0, myLargeArray.GetUpperBound(0)) 
        )
        .Distinct()
        .Take(2000)
        .ToArray();
    if (indices.Length == 2000) break;
}
var randomItems = indices
    .Select
    (
        i => myLargeArray[i]
    )
    .ToList();
John Wu
  • 50,556
  • 8
  • 44
  • 80
  • 1
    this can fail by chance..... you are relying that if you generate 3000, you will get 2000 distinct items.... – Keith Nicholas Sep 15 '19 at 21:55
  • @KeithNicholas no it can not. The fact he does the `.Distinct()` before he does the `.Take()` makes sure of that. To John, I updated your answer to match the description in your post in how many numbers to generate. Feel free to roll back if you wish. – Scott Chamberlain Sep 15 '19 at 22:17
  • You can solve this be writing a method which yields a new random number from inside an infinite loop, and using that as the soure. – canton7 Sep 15 '19 at 22:18
  • @ScottChamberlain he changed it – Keith Nicholas Sep 15 '19 at 22:22
  • @ScottChamberlain however, if he randoms 0 every single time (however unlikely), it still won't work – Keith Nicholas Sep 15 '19 at 22:23
  • Yes, that is why you need to put a check that the chance of you constantly getting the same option over and over is low (See my answer, I have that check in there as a ArgumentOutOfRangeException. – Scott Chamberlain Sep 15 '19 at 22:31
  • In my original answer, 3,000 non-unique integers were chosen in order to create a list of 2,000 unique integers. With a domain of 500,000 possible values, the probability of this coming up short are about (1/250)^1000, or so small you can't compute it in Excel. But to satisfy everyone's concerns, I've added logic to my example to deal with the edge case. Cheers all. – John Wu Sep 16 '19 at 02:58
  • Another simple solution would just be `Enumerable.Range(0, int.MaxValue)` – canton7 Sep 16 '19 at 06:31
0

You can use Random class to generate indexes for range random pick and skip value for single random element.

    public static T PickRandom<T>(this IEnumerable<T> source) {
        var indexLessThan = source.Count() - 1;

        var skip = _random.Next(indexLessThan);

        return skip == 0 ? source.First() : source.Skip(skip).First();
    }

    public static IEnumerable<T> PickRandom<T>(this IEnumerable<T> source, int count) {
        var indexLessThan = source.Count();

        var usedIndexes = new HashSet<int>(count);


        do {
            if(usedIndexes.Add(_random.Next(indexLessThan))) {
                count--;
            }
        } while (count > 0);

        return source.Where((x, i) => usedIndexes.Contains(i));
    }

Full working console app with execution times.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace ConsoleApp6 {
    class Program {
        static void Main(string[] args) {
            var enumerable = Enumerable.Range(0, 5_000_000);

            var sw = new Stopwatch();
            sw.Start();

            var oldSingle = enumerable.PickRandomOld();

            Console.WriteLine($"Old single took: {sw.Elapsed.ToString()}");
            sw.Restart();

            var oldMultiple = enumerable.PickRandomOld(2000).ToList();

            Console.WriteLine($"Old multiple took: {sw.Elapsed.ToString()}");
            sw.Restart();

            var orderBySingle = enumerable.PickRandomOrderBy(2000).ToList();

            Console.WriteLine($"OrderBy took: {sw.Elapsed.ToString()}");
            sw.Restart();

            var newSingle = enumerable.PickRandom();

            Console.WriteLine($"New single took: {sw.Elapsed.ToString()}");
            sw.Restart();

            var newMultiple = enumerable.PickRandom(2000).ToList();

            Console.WriteLine($"New multiple took: {sw.Elapsed.ToString()}");
            sw.Stop();

            Console.ReadLine();
        }
    }

    public static class EnumerableExtension {

        private static Random _random = new Random(DateTime.UtcNow.Millisecond);

        public static T PickRandomOld<T>(this IEnumerable<T> source) {
            return source.PickRandomOld(1).Single();
        }

        public static IEnumerable<T> PickRandomOld<T>(this IEnumerable<T> source, int count) {
            return source.Shuffle().Take(count);
        }

        public static IEnumerable<T> PickRandomOrderBy<T>(this IEnumerable<T> source, int count) {
             return source.OrderBy(n => _random.Next()).Take(count);
        }

        private static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source) {
            return source.OrderBy(x => System.Guid.NewGuid());
        }

        public static T PickRandom<T>(this IEnumerable<T> source) {
            var indexLessThan = source.Count() - 1;

            var skip = _random.Next(indexLessThan);

            return skip == 0 ? source.First() : source.Skip(skip).First();
        }

        public static IEnumerable<T> PickRandom<T>(this IEnumerable<T> source, int count) {
            var indexLessThan = source.Count();

            var usedIndexes = new HashSet<int>(count);


            do {
                if(usedIndexes.Add(_random.Next(indexLessThan))) {
                    count--;
                }
            } while (count > 0);

            return source.Where((x, i) => usedIndexes.Contains(i));
        }
    }
}

Execution times:

Old single took: 00:00:01.5138662
Old multiple took: 00:00:01.0036289
OrderBy took: 00:00:00.2382231
New single took: 00:00:00.0022536
New multiple took: 00:00:00.1563212

Hope it helps.

dropoutcoder
  • 2,627
  • 2
  • 14
  • 32
0

Shuffling with GUID isn't a good shuffle and is slower than using random.

Also you are shuffling every single time rather than just shuffling once and taking as many as you need.

    sw.Start();
    var r = new Random();
    var shuffled = items.OrderBy(n => r.Next()).Take(2000).ToList();
    sw.Stop();
    Console.WriteLine($"{sw.Elapsed}");
Keith Nicholas
  • 43,549
  • 15
  • 93
  • 156
0

UPDATE: I fixed the function so if you do a foreach on the same enumerable multiple times it gives you the same output each time.

Because you are only taking a small percentage of the total amount (0.04%) of the total you could get away taking randomly from the list then checking if the one you received was already selected, if it was then retry selecting randomly again.

public static class CollectionExtension
{
    public const double DEFAULT_SAFETY_MARGIN = 0.01;
    private static readonly Random _defaultRandomSrc = new Random();

    /// <summary>
    /// Picks random items from a list, will only allow you to take up to the safety margin of items.
    /// </summary>
    /// <param name="source">The source list.</param>
    /// <param name="count">The number of items to take. This must be less than or equal to
    /// <see cref="DEFAULT_SAFETY_MARGIN"/> * <paramref name="count"/>.</param>
    /// <returns>A <see cref="IEnumerable{T}"/> that will contain <paramref name="count"/> items.</returns>
    /// <exception cref="ArgumentOutOfRangeException">Thrown when <paramref name="count"/> is more
    /// than <see cref="DEFAULT_SAFETY_MARGIN"/> * <paramref name="count"/>.</exception>
    public static IEnumerable<T> PickFewRandom<T>(this IList<T> source, int count)
    {
        if ((double)count / source.Count > DEFAULT_SAFETY_MARGIN)
            throw new ArgumentOutOfRangeException(nameof(count), count,
                $"The number of items you are taking must be less than {source.Count * DEFAULT_SAFETY_MARGIN}");

        //Random is not thread safe so when we generate the seed from the
        //static random instance we must do a lock.
        int randomSeed;
        lock (_defaultRandomSrc)
        {
            randomSeed = _defaultRandomSrc.Next();
        }


        // We use a separate method here because if the "yield return" was in this method the
        // ArgumentOutOfRangeException would not get thrown till the user tried to enumerate
        // the result instead of when calling the function.
        return PickFewRandomInternal(source, count, randomSeed);
    }

    private static IEnumerable<T> PickFewRandomInternal<T>(IList<T> source, int count, int randomSeed)
    {
        //We pass in randomSeed so if you run foreach on the returned IEnumerable multiple times it will return the same order each time.
        var random = new Random(randomSeed);
        var pickedItems = new HashSet<int>(count);
        for (int i = 0; i < count; i++)
        {
            int index;
            // This looping is only safe to do because there is a low probability of duplicates,
            // if we where taking more items then this function could take a long time to run.

            do
            {
                index = random.Next(count);
            } while (!pickedItems.Add(index));

            yield return source[index];
        }
    }
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
-1

You could run through the array once and code something like

// This is C++ style I am guessing it should be similar in C#

for(int i = 0; i<baseListSize; ++i){
   if(random_in_range(0,1)<=0.0004){
     derivedList.append(baseList[i]);
   }
}

Which would run through the base array once and randomly pick 1/2500 of the elements on average to include in the derived array. This would only give you approximately 1/2500, though.

Luis Sosa
  • 140
  • 8