10

I am looking for a data structure which keeps the top n elements, similar to this question, but with the added requirement of maintaining sort order. Obviously I could just sort at the end but there might be a more efficient way to do it on the fly. There will only be insertions, never removals, and then an iteration through the top n elements at the end.

This question is language agnostic but it will be in C# so an answer that uses native .NET collections is preferable.

EDIT: I should clarify that the sort order only matters at the very end when the top n elements are iterated over. Along the way as there are insertions the sort order is irrelevant as long as the top n elements are retained.

Community
  • 1
  • 1
Craig W
  • 4,390
  • 5
  • 33
  • 51
  • how about a priority queue? If backed by a heap, it'll be pretty efficient. Not sure if it's a native .NET collection though – alrikai Feb 20 '13 at 00:24
  • If N is relatively small, there are only insertions, and you want to keep them in sorted order, a bubble sort on each insertion isn't that bad. – Hot Licks Feb 20 '13 at 03:38
  • Is it possible to keep track of the items that actually change value when they are updated ? Are they only updated once per iteration (before the final top-N sort) ? – wildplasser Feb 20 '13 at 17:42

6 Answers6

2

You'll have to give us a clue on the order of n and the number of items to be inserted.

I would think the sort order is relevant, how else will you know which elements are part of the top n? Just because you only want the top n at the end of all insertions may be creating a bias for / against structures or algorithms.

Why keep any of the items that aren't in the top n? You could use a sorted set (thinking of Python's deque) of size n+1. When adding, iterate through the list and insert the new item in the correct location in the set. When the set gets to size (n+1), each insert is followed by delete of the last item. This way you always have the top n items without growing the data structure with items that will never be retrieved.

In addition, if you keep the value of the bottom element (the last of n, call it b) then you can use that to skip the insert altogether. This limits the number of comparisons to O(1) when new item x is larger than b.

Kelly S. French
  • 12,198
  • 10
  • 63
  • 93
  • *n* will be on the order of 1-100, and typically around 10. On average there will be a few hundred to a few thousand insertions or attempted insertions. – Craig W Feb 20 '13 at 18:02
  • The problem with tracking only the bottom element is worst case complexity, after n insertions, every new insertion will be in the top n, forcing you to get rid of your current bottom and search for the new bottom at a cost of O(n). We don't really save much, then. – AndyG Feb 20 '13 at 19:46
  • Wouldn't it average out to O(n/2) except for the case where the items being inserted are pre-sorted in reverse order? For any items > b we still have O(1). Combining those two populations would require more knowledge about the set of potential insertions. – Kelly S. French Feb 20 '13 at 21:11
  • I liked Kelly's idea of indexing the last element, and if you look at my answer it is simple to do if you are using an T[] array and a point (int) to the last element. Adding a new last element is then O(1). – Moop Feb 20 '13 at 21:29
1

If you really need to keep them sorted all the time, you have to use a self-balanced binary search tree. But take into account that this (keeping the elements sorted) is not an optimization, but a luxury that has a cost.

A self-balanced binary search tree is slower than an implicit heap by a constant factor.

How do you want to access the sorted elements? If you just want to iterate through the sorted sequence, a normal self-balanced binary search tree will be enough.

If you want to access any element by position in the sorted sequence at any time (another luxury...) then you need to augment the tree. Basically, every node would have an additional field counting the number of nodes in its subtree, including itself.

comocomocomocomo
  • 4,772
  • 2
  • 17
  • 17
1

This answer is similar to Kelly's but has a tested code example. Since the size of N is small < 100, I used a simple insertion sort, modified with a binary search lookup if the number of items is above some (non-optimized) value (e.g. 20 items). I have included a sample console app (C#) to show its use. I tested it briefly to make sure it works, but I didn't do a full analysis of it at the moment. This structure is optimized for reducing memory usage.

public class TopNStructure<T> : IEnumerable<T> where T : IComparable<T>
{
    private const int SizeForLinearOrBinaryInsert = 20;

    private int _maxSize;
    private int _currentSize;
    private T[] _items;
    private IComparer<T> _comparer;

    /// <summary>
    /// The number of items
    /// </summary>
    public int Count { get { return _currentSize; } }

    public TopNStructure(int maxSize, IComparer<T> comparer)
    {
        if (maxSize <= 0)
        {
            throw new ArgumentOutOfRangeException("Max size must be a postive, non-zero value");
        }
        _maxSize = maxSize;
        _currentSize = 0;
        _items = new T[maxSize];
        _comparer = comparer;
    }

    public TopNStructure(int maxSize)
        : this(maxSize, Comparer<T>.Default) { }

    /// <summary>
    /// Adds an item to this structure
    /// </summary>
    /// <param name="item">The item to add</param>
    /// <returns>True if the item was added, false otherwise</returns>
    public bool Add(T item)
    {
        if (_currentSize == 0)
        {
            _items[0] = item;              
        }
        else if (_currentSize == _maxSize)
        {
            if (_comparer.Compare(_items[_currentSize - 1], item) <= 0)
            {
                return false;
            }
            else
            {
                Insert(item);
                return true;
            }
        }
        else if (_currentSize == 1)
        {   
            if (_comparer.Compare(_items[0], item) <= 0)
            {
                _items[1] = item;
            }
            else
            {
                _items[1] = _items[0];
                _items[0] = item;
            }               
        } 
        else 
        {
            if (_comparer.Compare(_items[_currentSize - 1], item) <= 0)
            {
                _items[_currentSize] = item;
            }
            else
            {
                Insert(item);
            }
        }
        _currentSize++;
        return true;
    }

    /// <summary>
    /// Insert the item into the data structure
    /// </summary>
    /// <param name="item">The item to insert</param>
    private void Insert(T item)
    {
        int start = 0;
        if (_currentSize >= SizeForLinearOrBinaryInsert)
        {
            start = Array.BinarySearch<T>(_items, 0, _currentSize, item, _comparer);
            if (start < 0)
            {
                start = ~start;
            }
            ShiftAndInsert(item, start, _currentSize);                
            return;
        }
        else
        {
            for (int i = start; i < _currentSize; i++)
            {
                if (_comparer.Compare(_items[i], item) > 0)
                {
                    ShiftAndInsert(item, i, _currentSize);                      
                    return;
                }
            }
            _items[_currentSize] = item;
        }                           
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="index"></param>
    /// <param name="maxIndex"></param>
    private void ShiftAndInsert(T item, int index, int maxIndex)
    {
        if (maxIndex >= _maxSize)
        {
            maxIndex = _maxSize - 1;
        }
        for (int i = maxIndex; i > index; i--)
        {
            _items[i] = _items[i - 1];
        }
        _items[index] = item;
    }


    public IEnumerator<T> GetEnumerator()
    {
        return ((IEnumerable<T>)_items).GetEnumerator();
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return _items.GetEnumerator();
    }
}

static void Main(string[] args)
{
    TopNStructure<double> data = new TopNStructure<double>(25);

    Random rand = new Random(132151);
    for (int i = 0; i < 50; i++)
    {
        double value = rand.NextDouble();
        data.Add(value);
    }

    int j = 0;
    foreach (double value in data)
    {
        Console.WriteLine("{0} {1}", j, value);
        j++;
    }
    Console.ReadKey();
}
Moop
  • 3,414
  • 2
  • 23
  • 37
0

It may be worth it to simply sort your top k elements at the end before you iterate over them. If k is sufficiently small (say less than half the total number of elements), then this will be cheaper than maintaining a sorted list that you never query.

Look into pre-built partial sorting methods, like STL's partial sort.

Creating your sorted list for all elements will be done in O(n log n) for comparison-based sorting, no matter whether if it's on the fly or not. Partial sort will be slightly better.

AndyG
  • 39,700
  • 8
  • 109
  • 143
  • 1
    Excuse my ignorance but how do you make sure you are retaining the top *n* elements without keeping them sorted along the way? – Craig W Feb 20 '13 at 17:37
  • @CraigW, you'd keep everything, and then sort at the end and toss off what you don't need. If memory is an issue, then maintaining a balanced binary tree is probably the best way. – AndyG Feb 20 '13 at 17:42
  • If you don't have enough memory for the whole list the balanced binary tree won't help either, however memory efficent it is. Also, keeping all elements causes overhead for all the items not in the top _n_, whether storing them or finally sorting them. – Kelly S. French Feb 20 '13 at 17:47
  • Memory is definitely an issue. There will be thousands of these data structures at a time, so I can only afford to ever keep the top *n*. – Craig W Feb 20 '13 at 17:52
  • @KellyS.French, my intent was that, knowing a priori your n elements, you'd know how big your tree should be. If an insertion causes an element to be pushed outside this size, it's deleted. You can do the same thing with any kind of sorted data structure. – AndyG Feb 20 '13 at 17:53
  • 1
    @CraigW: You're making a time-space tradeoff here. If your case is unique enough, you may want to hack together your own data structure. For example, allocate space for 2n elements. When it gets full, re-sort and cut out the lower half. You can guarantee sorts only happen after n insertions, and you're sorting on partially sorted data. This may be quicker than maintaining a sorted data structure the whole time. – AndyG Feb 20 '13 at 18:01
0

Here is a similar data structure to my first one, this time using internal Linked list to improve insertion speeds. You lose some speed because you cannot use a binary search to find the insertion point, but insertion and removals (of items > n) are O(1) and should balance the lack of binary searching. This structure uses more memory since the Linked list have 2 additional pointers.

public class TopNStructureLinkedList<T> : IEnumerable<T> where T : IComparable<T>
{
    private const int SizeForLinearOrBinaryInsert = 20;

    private int _maxSize;
    private int _currentSize;
    private LinkedList<T> _items;
    private IComparer<T> _comparer;
    private LinkedListNode<T> _largestItemNode;

    /// <summary>
    /// The number of items
    /// </summary>
    public int Count { get { return _currentSize; } }

    public TopNStructureLinkedList(int maxSize, IComparer<T> comparer)
    {
        _maxSize = maxSize;
        _currentSize = 0;
        _items = new LinkedList<T>();
        _comparer = comparer;
        _largestItemNode = null;
    }

    public TopNStructureLinkedList(int maxSize)
        : this(maxSize, Comparer<T>.Default) { }

    /// <summary>
    /// Adds an item to this structure
    /// </summary>
    /// <param name="item">The item to add</param>
    /// <returns>True if the item was added, false otherwise</returns>
    public bool Add(T item)
    {
        if (_currentSize == 0)
        {
            _largestItemNode = _items.AddFirst(item);               
        }
        else if (_currentSize == 1)
        {
            if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
            {
                _largestItemNode = _items.AddAfter(_largestItemNode, item);                   
            }
            else
            {
                _items.AddBefore(_largestItemNode, item);                   
            }
        }
        else if (_currentSize == _maxSize)
        {
            if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
            {
                return false;
            }
            else
            {
                Insert(item);
                _largestItemNode = _items.Last.Previous;
                _items.RemoveLast();
                return true;
            }
        }
        else
        {
            if (_comparer.Compare(_largestItemNode.Value, item) <= 0)
            {
                _largestItemNode = _items.AddAfter(_largestItemNode, item);       
            }
            else
            {
                Insert(item);
            }
        }
        _currentSize++;
        return true;
    }

    /// <summary>
    /// Insert the item into the data structure
    /// </summary>
    /// <param name="item">The item to insert</param>
    private void Insert(T item)
    {
        LinkedListNode<T> node = _largestItemNode.Previous;
        while (node != null)
        {              
            if(_comparer.Compare(node.Value, item) <= 0) {
                _items.AddAfter(node, item);
               return;
            }
            node = node.Previous;               
        }
        _items.AddFirst(item);

    }

    public IEnumerator<T> GetEnumerator()
    {
        return ((IEnumerable<T>)_items).GetEnumerator();
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return _items.GetEnumerator();
    }
}
Moop
  • 3,414
  • 2
  • 23
  • 37
0

If memory isn't a problem, it is still better to partial sort the entire array in the end. Even if you use an O(log n) insertion data structure, the best complexity you can achieve this way is O(n log k): n insertions costing O(log k).

Using a Selection Algorithm to find the k top elements of an array gives you a complexity of O(k log n). k is lesser than n, so it's better.

There is an implementation at the wikipedia article for QuickSelect. Also, using a pure PriorityQueue (in a different way most of people mentioned here) is easier. In a glance:

create_heap(array) // O(n)
for(int i=0; i<k; i++)
    sorted[i] = heap_pop(array) //O(log n)
Juan Lopes
  • 10,143
  • 2
  • 25
  • 44