1

I am looking for a suitably efficient storage structure for processing data that arrives in random order, but must be processed and/or removed from the stack in a specified order.

To make this clearer:

Each item x has an index i, a time-stamp t and is processed as follows (assuming the storage structure is already populated).

repeat
  1) Process (remove) the item with the smallest time stamp.
  2) Add new items (0 or more).
  3) Remove (0 or more) items (referenced by their index).
until false

It is guaranteed that each item will have unique timestamp and index. But it is impossible to predict how many items will be added in step 2) until step 1) is completed, or how many items will be removed in step 3) until steps 1) and 2) are completed within each loop of the algorithm. The distribution of timestamps for the incoming items cannot be predicted (except that new timestamps will be in the 'future'), and may vary with time - i.e. there could be a bunch of randomly distributed timestamps followed by a bunch with timestamps all greater (or less than) any of the items remaining in the list. There will be a maximum number of elements waiting to be processed at any one time, but this could be fairly large ~10^6-10^8, and I need to process them as fast as possible - note that the actual processing time is fairly small so for large sets of data I expect that the 'scheduling' will dominate the rate that I can process data.

If I add each item to a linked sorted list as it arrives in step 2), then step 1) is O(0), but step 2) is O(n). If I use a binary tree then step 1) is still O(1) and now step 2) is initially O(log n), but if the timestamps are not nicely distributed the tree could become unbalanced very quickly which would slow step 2) down significantly (eventually no better than O(n) if I don't regularly re-balance the tree).

My guess is that something along the lines of a binary tree with regular re-balancing should provide O(log n) if the re-balancing is done at the right intervals, but I assume that this sort of problem is well-solved, so could someone direct me to a suitable reference or give me a little advice to help me avoid re-inventing the wheel.

Penguino
  • 2,136
  • 1
  • 14
  • 21
  • 2
    So .. I stopped reading all that at about "to make this clear", but why not a [Priority Queue](http://en.wikipedia.org/wiki/Priority_queue)? (Look under the implementations.) –  Aug 02 '12 at 03:13
  • What you want is a time-scheduling queue, which is a form of Priority Queue. – RBarryYoung Aug 02 '12 at 03:19

1 Answers1

4

With the data structure heap you can easily do this with all operations in O(logN).

http://en.wikipedia.org/wiki/Heap_(data_structure)

The heap takes the time-stamp as the key. and you will need an additional array(or dict) to store the pair Q=(item.Index, the position of the item in the heap).

As it's a heap, operations 1) and 2) will cost O(logN) for each item.

And for operation 3), you will need to remove random items from a heap. Luckily, it's easy, as mentioned here. Because the item.index is not the actual place it's in the heap, you will need the dictionary Q mentioned above , to look for the position of the item in the heap by its item.Index in O(1) (for hash map), or it will cost O(N) to look for that position. And as the position of an item could change during the operations (including operation 1 and 2), remember to change the value in Q each time an item is moved in the heap.

As someone mentioned about Priority Queue, I'm going to add some more words here.

A Priority Queue is an abstract data type, with some abstract interfaces. And those interfaces don't include "removing random items" in common sense.

A heap is a data structure.

A priority queue could be implemented with a heap. But a heap could support much more operations than a priority queue.

Community
  • 1
  • 1
lavin
  • 2,276
  • 2
  • 13
  • 15