11

I have a collection of items (big rationals) that I'll be processing. In each case, processing will consist of removing the smallest item in the collection, doing some work, and then adding 0-2 new items (which will always be larger than the removed item). The collection will be initialised with one item, and work will continue until it is empty. I'm not sure what size the collection is likely to reach, but I'd expect in the range 1M-100M items. I will not need to locate any item other than the smallest.

I'm currently planning to use a red-black tree, possibly tweaked to keep a pointer to the smallest item. However I've never used one before, and I'm unsure whether my pattern of use fits its characteristics well.

1) Is there a danger the pattern of deletion from the left + random insertion will affect performance, eg by requiring a significantly higher number of rotations than random deletion would? Or will delete and insert operations still be O(log n) with this pattern of use?

2) Would some other data structure give me better performance, either because of the deletion pattern or taking advantage of the fact I only ever need to find the smallest item?

Update: glad I asked, the binary heap is clearly a better solution for this case, and as promised turned out to be very easy to implement.

Hugo

Hugo van der Sanden
  • 396
  • 1
  • 3
  • 12
  • Unless you know for certain that nodes that should be logically deleted are not going to be needed by newly computed values, you might want ignore or delay deletions. A Halt & Sweep approach should work for the latter, where the roots of sub-trees that have gotten too messy are visited by the rebalancing code to rebalance en'masse. This prevents gross degeneration, while still offering the likely prospect of delete-less performance. –  Feb 09 '13 at 22:51

3 Answers3

11

A binary heap is a lot better for what you want. It is easier to implement and faster since you only care about locating the smallest element and insertions. Locating the smallest element is O(1), removing it is O(log N), and an insertion is also O(log N).

IVlad
  • 43,099
  • 13
  • 111
  • 179
  • actually, if he knows he's always inserting a larger item than the removed one, a binary heap (treap) will end up being very unbalanced at one point. 100M records is a lot, so this can get unbalanced enough so that it's no longer O(log(n)), but rather O(n) - for example, if the height of the treap ended up being 160k when n=100M, then that's O(n/((lgn)^2)) – Etai Jun 12 '13 at 16:33
  • @Etai - a binary heap is always `O(log N)` for the operations I have mentioned. I don't know why you mentioned treaps, my answer refers to binary heaps, not treaps. Heaps do indeed play a role in the structure of treaps, but the two are different data structures. – IVlad Jun 12 '13 at 16:42
  • Binary heap insertion is `O(1)` average (worst case for Brodal), and that is the major reason to use it over BST: http://stackoverflow.com/a/29548834/895245 – Ciro Santilli OurBigBook.com Jun 20 '15 at 06:24
5

A heap will give you O(1)O(log n) removal and O(log n) insertion, and is much easier to implement than a red-black tree

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
1

It's good to know how to create the more complicated data structures if you need to. However, generally your best bet is to start off as simple as you can, and only use something more complex when it turns out you need to.

The only time I ever implemented a self-balancing tree was one time when I happened to know that my tree was going to be really big (over 10,000 elements), and the data was going to come in sorted spurts. That meant if I'd used a normal binary tree, I would have ended up with nearly a linked-list.

If your data is being entered in a randomish order, you really shouldn't bother with a balancing algorithim.

T.E.D.
  • 44,016
  • 10
  • 73
  • 134
  • Agree in general about KISS first, and complex only if needed. There are many ways to get around the requirement for self-balancing, such as creating an index to read data in random order, but the caveat is this only works if you know the requirement. IE: not for general-purpose use, like in creating a library. Also very bad etiquette to leave this chore for some poor bastard who has to maintain your code later. That said, I generally agree with your philosophy. –  Feb 09 '13 at 22:43