4

Can I have a set where average add/remove operation is O(1) (this is tipical for hashtable-based sets) and worst max/min is less then O(n), probably O(log n) (typical for tree-based sets)?

upd hmm it seems in the simplest case I can just rescan ALL N elements every time max/min disappear and in general it gives me O(1). But i apply my algorithm to stock trading where changes near min/max are much more likely so I just don't want to rescan everything every time max or min disappear, i need something smarter than full rescan which gives O(n).

upd2 In my case set contains 100-300 elements. Changes of max/min elements are very likely, so max/min changes often. And I need to track max/min. I still want O(1) for add/remove.

Oleg Vazhnev
  • 23,239
  • 54
  • 171
  • 305
  • I assume you want efficient deletion as well? – John Dvorak Jan 21 '14 at 17:33
  • @JanDvorak yes I want something like hashtable. with O(1) for all basic operations but + fast max/min, and changes to elements near max/min are much more likely. – Oleg Vazhnev Jan 21 '14 at 17:41
  • How about a sorted skip list? Doesn't give O(1) for adding/removing but still quite fast for all operations – Łukasz Kidziński Jan 21 '14 at 17:44
  • O(1) for add/remove is mandatory cause it's the main operation. – Oleg Vazhnev Jan 21 '14 at 17:46
  • You can get O(1) insertion, O(1) find-min and find-max, and O(log(n)) deletion with a combination of a hash table and a specialized heap. That's not quite what you asked for, but it's pretty close. The heap might have a bad constant factor, though. – user2357112 Jan 21 '14 at 17:46
  • @user2357112 no no don't touch basic operations - add and remove must be O(1). I just don't want to rescan every time max/min disappear because in my case max/min disappear very often. – Oleg Vazhnev Jan 21 '14 at 17:47
  • If a structure with O(1) for delete, add and find-max was discovered, noone would care about BSTs or heaps ;) – Łukasz Kidziński Jan 21 '14 at 17:48
  • What are the element types? (integers? numbers? complex elements?) and do you know anything about the distribution of the operations? (how often seek max is given? distribution to add `x` as a function of `x`?) If your elements are numbers in some known range and a known distribution of elements - you might be able to find a pretty good solution for your specific case. – amit Jan 21 '14 at 17:52
  • It's unlikely I can have O(1) in average for max/min but can I have for example O(logn) for max/min in the worst case? – Oleg Vazhnev Jan 21 '14 at 17:55
  • @amit this is stock exchange prices converted to int64. so they are close but not too much. set can be something like "500 1000 and 1500", but can't be "1 50000 100000". so it's guaranteed that `max / min < 5` but nothing else. I don't think this can be helpful. – Oleg Vazhnev Jan 21 '14 at 17:57
  • can you guarantee the elements will be removed in the same order they will be added? – John Dvorak Jan 21 '14 at 18:34
  • no, set is set, order is not defined – Oleg Vazhnev Jan 22 '14 at 01:31
  • cross-linking this other question: http://stackoverflow.com/a/21278762/819272 – TemplateRex Jan 22 '14 at 09:40

2 Answers2

6

Here's an impossibility result with bad constants for worst-case, non-amortized bounds in a deterministic model where keys can be compared and hashed but nothing else. (Yes, that's a lot of stipulations. I second the recommendation of a van Emde Boas tree.)

As is usually the case with comparison lower bounds, this is an adversary argument. The adversary's game plan is to insert many keys while selectively deleting the ones about which the algorithm has the most information. Eventually, the algorithm will be unable to handle a call to max().

The adversary decides key comparisons as follows. Associated with each key is a binary string. Each key initially has an empty string. When two keys are compared, their strings are extended minimally so that neither is a prefix of the other, and the comparison is decided according to the dictionary order. For example, with keys x, y, z, we could have:

x < y:  string(x) is now 0, string(y) is now 1
x < z:  string(z) is now 1
y < z:  string(y) is now 10, string(z) is now 11.

Let k be a worst-case upper bound on the number of key comparisons made by one operation. Each key comparison increases the total string length by at most two, so for every sequence of at most 3 * n insert/delete operations, the total string length is at most 6 * k * n. If we insert 2 * n distinct keys with interspersed deletions whenever there is a key whose string has length at least 6 * k, then we delete at most n keys on the way to a set with at least n keys where each key has a string shorter than 6 * k bits.

Extend each key's string arbitrarily to 6 * k bits. The (6 * k)-bit prefix of a key's string is the bucket to which the key belongs. Right now, the algorithm has no information about the relative order of keys within a bucket. There are 2 ** (6 * k) buckets, which we imagine laid out left to right in the increasing order dictated by the (6 * k)-bit prefixes. For n sufficiently large, there exists a bucket with a constant (depending on k) fraction of the keys and at least 2 * k times as many keys as the combined buckets to its right. Delete the latter keys, and max() requires a linear number of additional comparisons to sort out the big bucket that now holds the maximum, because at most a little more than half of the required work has been done by the deletions.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
2

Well, you know that max/min < CONST, and the elements are all numbers. Based on this you can get O(1) insertion and O(k+n/k) find min/max 1.

Have an array of size k, each element in the array will be a hash set. At insertion, insert an element to array[floor((x-MIN)/MAX-MIN)*k)] (special case for x=MAX). Assuming uniform distribution of elements, that means each hash set has an expected number of n/k elements.
At deletion - remove from the relevant set similarly.

findMax() is now done as follows: find the largest index where the set is not empty - it takes O(k) worst case, and O(n/k) to find maximal element in the first non empty set.

Finding optimal k:

We need to minimize k+n/k.

d(n+n/k)/dk = 1-n/k^2 = 0
n = k^2
k = sqrt(n)

This gives us O(sqrt(n) + n/sqrt(n)) = O(sqrt(n)) find min/max on average, with O(1) insertion/deletion.

From time to time you might need to 'reset' the table due to extreme changes of max and min, but given a 'safe boundary' - I believe in most cases this won't be an issue.
Just make sure your MAX is something like 2*max, and MIN is 1/2*min every time you 'reset' the DS.


(1) Assuming all elements are coming from a known distribution. In my answer I assume a uniform distribution - so P(x)=P(y) for each x,y, but it is fairly easy to modify it to any known distribution.

amit
  • 175,853
  • 27
  • 231
  • 333
  • Very nice approach. What would be the complexity if this is done recursively, say, two levels deep with `cbrt(n)` buckets per level? I guess it would primarily affect min/max. – John Dvorak Jan 21 '14 at 19:14
  • @JanDvorak Started it but got a bit stuck, basically it will affect insertion/deletion to be `O(d)` - where `d` is the depth. min/max will now be `O(kd + n/k^d)`. deriving to find minimum will get us `k=n^(-d-1)`, which gives us `O(n^(-d-1)d + n/(n^(-d-1))^d)`. Do you think we can simplify it better? – amit Jan 21 '14 at 19:35
  • It will work on non-existent computer with a HUGE CPU cache memory only. How big is k? If k is big (i.e. array is big, up to one array item per one number) then array will occupy a lot of memory. If array occupy a lot of memory - you will often access main computer memory instead of CPU cache. Each access to main memory is 60-100 ns comparing to several tick to acccess CPU cache. If you make k small enough to fit entire structure to CPU cache you kill the idea of your algorithm - each basket will contain too much items. – Oleg Vazhnev Jan 22 '14 at 01:45
  • if i have 10 numbers `1 , 2 , 3 , ... , 10` , MAX-MIN is `9` so all numbers will get get hashed at same location 0 . How will you overcome this ? – Aseem Goyal Mar 26 '14 at 11:19
  • may be you meant `floor( ( [ (x-MIN)*k ] /(MAX-MIN) )` – Aseem Goyal Mar 26 '14 at 11:21
  • @aseem The division in `floor((x-MIN)/MAX-MIN)*k)` is NOT integer division, it's regular division, so they will not get all hashed to 0 according to it. – amit Mar 26 '14 at 12:23