2

I have a binary file which contains more than 100 millions of objects and I read the file using BinaryReader and return (Yield) the object (File reader and IEnumerable implementation is here: Performance comparison of IEnumerable and raising event for each item in source? )

One of object's properties indicates the object rank (like A5). Assume that I want to get sorted top n objects based on the property.

I saw the code for OrderBy function: it uses QuickSort algorithm. I tried to sort the IEnumerable result with OrderBy and Take(n) function together, but I got OutOfMemory exception, because OrderBy function creates an array with size of total objects count to implement Quicksort.

Actually, the total memory I need is n so there is no need to create a big array. For instance, if I get Take(1000) it will return only 1000 objects and it doesn't depend on the total count of whole objects.

How can I get the result of OrderBy function with Take function? In another word, I need a limited or blocked sorted list with the capacity which is defined by end-user.

Community
  • 1
  • 1
Amir Pournasserian
  • 1,600
  • 5
  • 22
  • 46

2 Answers2

1

If you want top N from ordered source with default LINQ operators, then only option is loading all items into memory, sorting them and selecting first N results:

items.Sort(condition).Take(N) // Out of memory

If you want to sort only top N items, then simply take items first, and sort them:

items.Take(N).Sort(condition)

UPDATE you can use buffer for keeping N max ordered items:

public static IEnumerable<T> TakeOrdered<T, TKey>(
    this IEnumerable<T> source, int count, Func<T, TKey> keySelector)
{
    Comparer<T, TKey> comparer = new Comparer<T,TKey>(keySelector);
    List<T> buffer = new List<T>();
    using (var iterator = source.GetEnumerator())
    {
        while (iterator.MoveNext())
        {
            T current = iterator.Current;
            if (buffer.Count == count)
            {
                // check if current item is less than minimal buffered item
                if (comparer.Compare(current, buffer[0]) <= 0)
                    continue;

                buffer.Remove(buffer[0]); // remove minimual item
            }
            // find index of current item
            int index = buffer.BinarySearch(current, comparer);
            buffer.Insert(index >= 0 ? index : ~index, current);
        }
    }

    return buffer;
}

This solution also uses custom comparer for items (to compare them by keys):

public class Comparer<T, TKey> : IComparer<T>
{
    private readonly Func<T, TKey> _keySelector;
    private readonly Comparer<TKey> _comparer = Comparer<TKey>.Default;

    public Comparer(Func<T, TKey> keySelector)
    {
        _keySelector = keySelector;
    }

    public int Compare(T x, T y)
    {
        return _comparer.Compare(_keySelector(x), _keySelector(y));
    }
}

Sample usage:

string[] items = { "b", "ab", "a", "abcd", "abc", "bcde", "b", "abc", "d" };
var top5byLength = items.TakeOrdered(5, s => s.Length);
var top3byValue = items.TakeOrdered(3, s => s);
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • I don't want to sort only first N objects, I want to get top N from sorted results. – Amir Pournasserian Nov 08 '13 at 12:25
  • @AmirPournasserian then you need to load all items and sort them, how else you will get sorted sequence without loading all items? – Sergey Berezovskiy Nov 08 '13 at 12:26
  • @AmirPournasserian hold on.. I have an idea – Sergey Berezovskiy Nov 08 '13 at 12:27
  • What will be the O value? As I know for BinarySearch the O is N*Log(N), it seems that in the worst case yours will be N*N*Log(N) – Amir Pournasserian Nov 13 '13 at 11:50
  • Sorry, if the total count of all objects is M and the result count is N, then in the worse case it will be M*N*Log(N) which is so too slow. – Amir Pournasserian Nov 16 '13 at 05:44
  • @AmirPournasserian that is much better than `M*M` I think (if you will sort whole sequence). And `M*N*log(N)` is the worst case - when you have all items sorted in ascending order. With unordered sequence you you will skip binary search for all items which are less than first item in buffer. In perfect world it will be `M+N*log(N)`. Also it stores only N items in memory instead of M items. – Sergey Berezovskiy Nov 19 '13 at 17:25
1

LINQ does not have a built-in class that lets you take the top n elements without loading the whole collection into memory, but you can definitely build it yourself.

One simple approach would be using a SortedDictionary of lists: keep adding elements to it until you hit the limit of n. After that, check each element that you are about to add with the smallest element that you have found so far (i.e. dict.Keys.First()). If the new element is smaller, discard it; otherwise, remove the smallest element, and add a new one.

At the end of the loop your sorted dictionary will have at most n elements, and they would be sorted according to the comparator that you set on the dictionary.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • How about the duplicate ranks? If the IComparer(of T) returns 0 the objects rank will be the same, then the same key will be added to the SortedDictionary (Exception will be raised) or will be omitted (result is not correct). I can use SortedDictionary(of integer, List(Of MyModel)) to resolve the issue, but the performance is extremely low. I'm looking for another solution to get better performance (if exists). – Amir Pournasserian Nov 08 '13 at 12:41