3

I found this post:

Efficiently selecting a set of random elements from a linked list

But this means that in order to approach true randomness in the sample I have to iterate over all elements, throw them in memory with a random number, and then sort. I have a very large set of items here (millions) - is there a more efficient approach to this problem?

Community
  • 1
  • 1
RobVious
  • 12,685
  • 25
  • 99
  • 181
  • 1
    Using a c# list is different than using a linked list as described in the link. – Dan Teesdale Jul 20 '13 at 15:19
  • 3
    We need more information. Do you already have the elements in memory, in a mutable collection? Do you also need to preserve the original order? – Jon Skeet Jul 20 '13 at 15:19
  • Depending on the already requested information, your best approach may be to do a partial [Fisher-Yates Shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle), stopping after `k` iterations. – Patricia Shanahan Jul 20 '13 at 15:27
  • Are you looking for elements at unique indexes? Are you looking for unique element values? Do you need elements sorted by value or sorted by their indexes? – Olivier Jacot-Descombes Jul 20 '13 at 15:28
  • @JonSkeet - I'm working with LINQ so I think that means they can be in memory if needed. They are in a Queryable DbSet. I do not need to preserve the original order - I would prefer not to, in the name of randomness. – RobVious Jul 20 '13 at 15:29
  • @OlivierJacot-Descombes - I need any random unique selection of k elements - no duplicates, and no sorting. – RobVious Jul 20 '13 at 15:30
  • @DanTeesdale - right - I'm dealing with a C# DbSet currently in List form but can be any LINQ-friendly collection type – RobVious Jul 20 '13 at 15:32
  • You don't have to load them all into a collection. See [Random selection from large groups](http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=796). – Jim Mischel Jul 20 '13 at 16:25
  • @PatriciaShanahan: I read your comment here after writing my answer. Great mind think alike ;) – Jon Skeet Jul 20 '13 at 16:53

3 Answers3

11

I would suggest simply shuffling elements as if you were writing a modified Fisher-Yates shuffle, but only bother shuffling the first k elements. For example:

public static void PartialShuffle<T>(IList<T> source, int count, Random random)
{
    for (int i = 0; i < count; i++)
    {
        // Pick a random element out of the remaining elements,
        // and swap it into place.
        int index = i + random.Next(source.Count - i);
        T tmp = source[index];
        source[index] = source[i];
        source[i] = tmp;
    }
}

After calling this method, the first count elements will be randomly picked elements from the original list.

Note that I've specified the Random as a parameter, so that you can use the same one repeatedly. Be careful about threading though - see my article on randomness for more information.

Dave
  • 10,748
  • 3
  • 43
  • 54
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Is there a way to adapt this algorithm for a list with weighted probability? – Iravanchi Nov 01 '14 at 07:32
  • @Iravanchi: You'd want to sum all the weights, so that when you pick a random number you can work out which value that corresponds to... but then when swapping the values, you'd need to swap the weights too. I think it would be hard to do very efficiently (you couldn't easily just use a cumulative list of weights, which would make it easy to find the corresponding value - updating the cumulative list would be a pain) but the rest would be okay. – Jon Skeet Nov 01 '14 at 08:10
3

Take a look at this extension method http://extensionmethod.net/csharp/ienumerable-t/shuffle. You could add Skip() Take() type to page the values out the final list.

Code Uniquely
  • 6,356
  • 4
  • 30
  • 40
3

If the elements can be in memory, put them in memory first

List<Element> elements = dbContext.Select<Element>();

Now you know the number of elements. Create a set of unique indexes.

var random = new Random();
var indexes = new HashSet<int>();
while (indexes.Count < k) {
    indexes.Add(random.Next(elements.Count));
}

Now you can read the elements from the list

var randomElements = indexes.Select(i => elements[i]);

I assume that the DB contains unique elements. If this is not the case, you will have to create a HashSet<Elements> instead or to append .Distinct() when querying from the DB.


UPDATE

As Patricia Shanahan says, this method will work well if k is small compared to n. If it is not the case, I suggest selecting a set n - k indexes to be excluded

var random = new Random();
var indexes = new HashSet<int>();
IEnumerable<Element> randomElements;

if (k <= elements.Count / 2) {
    while (indexes.Count < k) {
        indexes.Add(random.Next(elements.Count));
    }
    randomElements = indexes.Select(i => elements[i]);
} else {
    while (indexes.Count < elements.Count - k) {
        indexes.Add(random.Next(elements.Count));
    }
    randomElements = elements
        .Select((e,i) => indexes.Contains(i) ? null : elements[i])
        .Where(e => e != null);
}
Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188