15

I ran across an assertion that HashSet<T>.Contains() is an O(1) operation. This surprised me since every discussion of hashing I've encountered mentions the possibility of collisions, potentially leading to O(n) run time.

Being curious, I looked into documentation for HashSet<T>.Contains and also HashTable.Contains. The documentation for both methods make that same claim.

When I look in reflector, HashSet<T>.Contains() is implemented with a for loop, going through a list of slots containing values which have the same hash.

Now admittedly, those same discussions of hashing have also mentioned that a good hashing algorithm avoids collisions and under those circumstances the lookup will indeed be O(1). But my understanding of Big O notation is that it's the worst-case runtime time, not best.

So is the O(1) claim incorrect? Or am I missing something?

ThatBlairGuy
  • 2,426
  • 1
  • 19
  • 33
  • 2
    I hate big O notation =] – Luiscencio Jul 21 '10 at 17:21
  • 2
    @Luiscencio Big O notation is simply the words that let you tell another programmer how a function will scale. What words do you suggest that will quickly give another programmer a semi-accurate idea of how well a given function scales? – Bill K Jul 21 '10 at 17:29
  • 2
    [joke] what about your "functions is f*****g eating the f*****g processor" – Luiscencio Jul 21 '10 at 17:48
  • http://c-sharp-snippets.blogspot.com/2010/03/runtime-complexity-of-net-generic.html – Jaroslav Jandek Jul 21 '10 at 17:50
  • 2
    I'd be fun to post an assertion that hash table lookups have O(n!) time complexity, which is technically correct albeit a bit misleading, and see how many down-votes it gets. – Staffan Jul 21 '10 at 17:52
  • Possible duplicate of [Can hash tables really be O(1)?](http://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1) – DavidRR May 14 '17 at 17:43

9 Answers9

9

But my understanding of Big O notation is that it's the worst-case runtime time, not best.

Unfortunately, there is no "standard" for Big-O when describing algorithms. Often, it's used to describe the general or average case - not the worst case.

From Wikipedia:

...this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case...

In this case, it's describing a standard case, given proper hashing. If you have proper hashing in place, the limiting behavior will be constant for size N, hence O(1).

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
  • 4
    Yep. Another prominent example is Quicksort — O(n^2) worst case, but often considered to be O(n log n) as this is the average complexity. – kennytm Jul 21 '10 at 17:46
  • When I learned it, big O is used to denote the limit, without regards for best/worst/average case; however, in times when the best, worst, and average cases have significant disconnect, big O is typically used for average case analysis. Use big theta for worst case. – Razor Storm Jul 21 '10 at 18:25
  • That's surprising, I would have expected worst case to be the more typical use, though (particularly for hashing) having the worst case come up frequently would probably be a motivation to look for a better algorithm. I can certainly see where the general/average case would be useful though. In the case of hashing, I would expect O(1) most of the time. – ThatBlairGuy Jul 22 '10 at 13:06
  • Hash table lookup is only O(1) *amortized*, see [Staffan's answer](http://stackoverflow.com/questions/3301983/o1-hash-look-ups/3302044#3302044). And even that requires some assumptions. – Gilles 'SO- stop being evil' Jul 23 '10 at 22:59
  • @Gilles: Most hashing structures are not amortized, however. They do not modify the original hash to prevent collisions in the future, which would be required. (There are some that do, but .NET's does not...) – Reed Copsey Jul 23 '10 at 23:02
  • @Gilles: It is possible, with a very bad hash, to **guarantee** worst case every time with a .NET hash. If you don't believe me, try using HashSet with a custom class that implements GetHashCode as `return 1;` – Reed Copsey Jul 23 '10 at 23:03
  • @Reed: yup, in fact it's possible to guarantee worst case with most hashes, see my comment below Staffan's answer. But if you override the hash calculation with a maximally bad `GetHashCode`, it's your fault if hashing is slow! – Gilles 'SO- stop being evil' Jul 23 '10 at 23:29
7

In general, it's O(1).

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • Even considering the known poor performance of the built-in `GetHashCode`? I wouldn't depend on it being O(1)... – Stephen Cleary Jul 21 '10 at 17:40
  • 2
    @Stephen: What are you talking about? Also, even if `GetHashCode` takes an hour to return, it's still O(1) – the performance of `GetHashCode` doesn't scale with the size of the set. – SLaks Jul 21 '10 at 17:43
  • @SLaks, I'd guess Stephen was referring to the default implementation's poor suitability for hashing. See http://stackoverflow.com/questions/720177/default-implementation-for-object-gethashcode/720196#720196 – Ben M Jul 21 '10 at 17:50
  • 2
    @Slaks: Ben is correct. It's not that `GetHashCode` takes a long time to return, it's that it's a poor hashing algorithm. Which causes collisions. Which pushes the "O(1)" reflex answer towards incorrectness, because it's no longer true on average. – Stephen Cleary Jul 21 '10 at 17:56
  • `HashSet` uses the `GetHashCode` method of `IEqualityComparer` which you can specify in constructor and influence the performance (for good or bad). – Jaroslav Jandek Jul 21 '10 at 18:27
6

For a properly implemented hash table, look ups have amortized constant time complexity.

In practice, a single look up can be O(n) in case of collisions, as you say. However, if you perform a large number of look ups the average time complexity per operation is constant.

Quoting wikipedia:

Amortized analysis differs from average-case performance in that probability is not involved; amortized analysis guarantees the time per operation over worst-case performance.

The method requires knowledge of which series of operations are possible. This is most commonly the case with data structures, which have state that persists between operations. The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.

Community
  • 1
  • 1
Staffan
  • 1,769
  • 13
  • 15
  • Indeed, amortized complexity must be mentioned in a good description of hash table complexity. But note that amortized O(1) complexity requires an assumption that the keys are sufficiently randomly distributed. If an attacker chooses the keys to add to the hash, he can force a collision every time. This could be avoided by using a cryptographic hash, but these are very expensive, so you'd get constant time with a prohibitively large constant. Another way is to include a random seed in the hash (perl did this at some point). – Gilles 'SO- stop being evil' Jul 23 '10 at 23:05
5

No, Big O does not define "worst case", it defines a limit. Hash-based lookups (with good hashing algorithms that provide efficient value distribution and a low collision rate) progress toward a constant value as the number of items increases (they will never reach or that constant value, but that's the point of it being a limit).

Adam Robinson
  • 182,639
  • 35
  • 285
  • 343
2

I believe it means O(1) on average.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
1

No, Big-O notation is not necessarily restricted to the worst-case. Typically you will see Big-O published for best-case, average-case, and worst-case. It is just that most people tend to focus on the worst-case. Except in the case of a hash table the worst-case rarely happens so using the average-case tends to be more useful.

Yes, a good hash function reduces the probability of a collision. A bad hash function may cause the clustering effect (where different values hash to the exact same value or close to the same value). It is easy to demonstrate that HashSet can indeed become O(n) by implementing the GetHashCode function in such a manner that it returns the same value all of the time.

In a nutshull, yes HashSet and Dictionary can be described as having O(1) runtime complexity because the emphasis is on the average-case scenario.

By the way, Big-O can be used to analyze amortized complexity as well. Amortized complexity is how a sequence of separate (and sometimes even different) operations behave when grouped together as if they were one big operation. For example, a splay tree is said to have amortized O(log(n)) search, insert, and delete complexity even though the worst-case for each could O(n) and the best-case is O(1).

Brian Gideon
  • 47,849
  • 13
  • 107
  • 150
0

My understanding of Big Oh is that the "worst case" generally is in reference to the number of elements involved. So if a function were to perform O(n) with 10 elements, but O(n squared) with 100 or more (not sure such an algorithm actually exists), then the algorithm is considered O(n squared).

Nick
  • 5,875
  • 1
  • 27
  • 38
0

O(1) does not necessarily mean "worst case". For hashes, one usually says that the "expected" lookup time is O(1), since the probability of hash collisions is small.

Ferdinand Beyer
  • 64,979
  • 15
  • 154
  • 145
  • That's what surprised me -- the phrasing in the various places I found references to the look up didn't say "expected" or "typical." They said "is," which implies always. – ThatBlairGuy Jul 22 '10 at 13:08
0

Hash tables not only have average-case performance O(1), but if the hash function is random, for any given percentage P < 100%, the performance that can be obtained P% of the time from a properly-designed hash tale is O(1). Although the extreme parasitic cases become more and more severe as N increases, that is balanced by the fact that even moderately-parasitic cases become less and less likely.

supercat
  • 77,689
  • 9
  • 166
  • 211