5

My code processes a huge number of values and I'm looking for an efficient structure to keep track of the top (N) values, where N is less than 10, so collecting ALL numbers then sorting the list and taking the first (N) is probably not the most efficient way.

To do that, I'm building a collection of fixed size N, to keep the top (N) values sorted in descending order. The Add(T value) method of the sorted collection would add the value to the collection if value is higher than any of the existing values (in which case the last element is removed) or if the collection is not full.

I was able to implement what I wanted using a doubly LinkedList<T> since it has fast insertion and removal, but I was wondering if using SortedDictionary<TKey, TValue> or a priority queue would be better ?

Thank you.

alhazen
  • 1,907
  • 3
  • 22
  • 43
  • possible duplicate of [priority queue with limited space: looking for a good algorithm](http://stackoverflow.com/questions/2933758/priority-queue-with-limited-space-looking-for-a-good-algorithm) –  Aug 20 '10 at 16:05
  • 3
    If you're talking about N<10, the performance differences between data structures are likely negligible. I would choose whatever is simplest to implement and easiest to understand. – Dan Bryant Aug 20 '10 at 16:09
  • Please stick with your initial implementation. For this problem, a LinkedList & binary replacement would be the best! – instcode Aug 20 '10 at 17:49

8 Answers8

7

I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.

Svante
  • 50,694
  • 11
  • 78
  • 122
  • 2
    if not, you could also sort the heap in the opposite order and remove the smallest item whenever the size is greater than N after an insert. – mihi Aug 20 '10 at 16:03
  • 1
    @Svante How do you implement a heap with a limited depth? After the size is greater than N, you have to evict the smallest element. Finding the smallest element in a max heap takes O(N) time. The only thing I can think of is it to have two heaps: a max heap and a min heap. Insert values into both heaps until you have N values. When you want to insert another value, pop the min value from the min heap. Find that value in the max heap (typically takes O(N) time, but you can maintain a hash map of value=>index for a heap) and remove it. Add the new value to both heaps. Is there another way? – John Kurlak Aug 03 '14 at 21:41
  • @Svante Hmm, I guess we could use a min-max heap :) – John Kurlak Aug 03 '14 at 21:51
4

The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value). But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...

Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).

EDIT:

fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this

digEmAll
  • 56,430
  • 9
  • 115
  • 140
  • Linked list is suboptimal because insertion is O(N) operation. – olegz Aug 20 '10 at 16:38
  • 1
    To be precise, insertion is O(1) because doesn't require shifting; to find where to insert requires O(n) (at most). I mentioned it when I said "linear comparison". Anyway I've just edited it, I've changed "best way" with "good way" ;) – digEmAll Aug 20 '10 at 16:51
  • 1
    Linked list uses one chunk of memory per item, and one new chunk need to be reserved for each new item; this is somewhat more expensive than an array, that can be reserved once for all (since its maximal size is known). While garbage collection works well, it is probably even better to avoid it. – user192472 Aug 20 '10 at 16:57
  • @fred-hh: yes you're right. In fact probably the best way is a fixed array as a binary heap. Anyway,I think that with at most 10 elements we'll never notice the difference... – digEmAll Aug 20 '10 at 17:02
3

For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).

phkahler
  • 5,687
  • 1
  • 23
  • 31
  • 1
    Although the set is quite small (10 elements only) but this sounds very inefficient to me. What if the new element is bigger than the biggest one in the list? You will replace the smallest with it, and the list becomes unsorted. Now you have to use a linear search through all the elements to find the smallest in the next insertions :-s – instcode Aug 20 '10 at 17:46
  • Right, the list is UNsorted. You only scan it after you swap the new value. You store the lowest value and it's position. Checking subsequent values is just a comparison with that stored lowest one. If you've got a larger one, then swap and rescan. – phkahler Aug 21 '10 at 14:04
3

The performance may really change.

For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.

Then there are 3 main possibilities on how to arrange the items in the array:

  1. sorted is probably the best choice to keep things simple:
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(N) time to insert - but this only happens for items that are in the N best-so-far. And if your input is sufficiently random, the average time will be even lower because most insertions will only move a few elements at the bottom of the top.
  2. unsorted:
    • O(N) time for each input element, that's too much compared to "sorted"
  3. binary heap that implements a priority queue: more complex to implement but maybe even faster than "sorted"
    • constant time to determine whether to insert a new item (compare with lowest)
    • O(log N) time to insert - and this only happens for items that are in the N best-so-far
user192472
  • 715
  • 7
  • 15
2

Unless you have a solid reason to do otherwise, I'd use a priority queue.

There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.

You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.

While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 1
    I would avoid "always insert each incoming item" if speed is really important, because insertion is not (that) cheap (O(log N)) and, anyway, after the N first incoming items, the necessary comparison with the top item in the priority queue (= worst of the N best) suffices. – user192472 Aug 20 '10 at 17:06
1

Edit:

If only the first N values are needed and the others are not of any interest, a plain old array will get the work done cheaply.

Keep it sorted and test against the biggest. And only if it needs to be stored, insert it correctly and shift the remaining elements. With small sizes this is a cheap operation, and my guess is it won't be done often.

Eiko
  • 25,601
  • 15
  • 56
  • 71
1

If you have a fix size of 10, why not simply use a sorted array of length 10 and binary search? But I am not sure if at this size, binary search is not a huge win over a dumb search along the array due to some overhead.

Frank
  • 2,628
  • 15
  • 14
  • In this case, the search would be fast, but inserting an element is expensive because it requires copying the array into a new one. – alhazen Aug 20 '10 at 16:01
  • No, as I understood the size should always be limited to 10 items, you would shift off the last one. You only would have to move all items after the position where you insert one further. And you completely avoid the object instantiation of a linked list insertion. – Frank Aug 20 '10 at 16:06
0

Use binary insertion sort on a raw array, pushing the smallest value off the end. This is routinely the fastest method used to maintain small sorted arrays and, for example, is generally used as a special case for various sorting algorithms (e.g. MergeSort).

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407