A query on a non-indexed column will result in O(n) because it has to search the entire table. Adding an index allows for O(log n) due to binary search. Is there a way for databases to achieve O(1) using the same technique a hash table uses (or perhaps another way) if you are searching on a unique key?
-
Are you asking about a specific database, or is this just general curiosity about how databases might be implemented? The latter seems more appropriate for a computer science site, not SO. – Barmar Oct 25 '13 at 08:32
-
4There's nothing preventing you from implementing a database index using a hash table instead of a tree. But it will only be useful for exact matching, not matching prefixes with `LIKE` or relational comparisons (< and >). – Barmar Oct 25 '13 at 08:34
-
1Clustered indices are mostly the best you can do, typically also at O(log n) lookup time (although obviously faster than non-clustered). O(1) is certainly possible, but not practical enough that database designers thought it was a good idea. – Bernhard Barker Oct 25 '13 at 08:35
-
@Barmar I am curious if a practical solution for existing database systems exists (and if not, then why). I use hash tables all the time when programming, but this is always in-memory, non-persistent storage. After reading this question, http://stackoverflow.com/questions/4363539/how-does-hashing-have-an-o1-search-time to better understand how a hash table achieves O(1), it brought me to wonder why databases don't use something similar. It's understandable that LIKE operations would not work, but searching for a unique key is a very common operation. – Despertar Oct 25 '13 at 08:50
-
1While searching for a unique key is common, so is sorting and searching for keys less than or greater than a value (e.g. date fields). It's not generally possible to have one index that satisfies all these needs, so a compromise is necessary. O(log n) is usually good enough, and it allows those other operations to be efficient as well. – Barmar Oct 25 '13 at 08:53
1 Answers
Hash-based indices are supported by some rdbms under some conditions. For example, MySQL supports the syntax CREATE INDEX indexname USING HASH ON tablename (cols…)
but only if the named table is stored in memory, not if it is stored on disk. Clustered tables are a special case.
I guess the main reason against widespread use of hash indices in rdbms is the fact that they scale poorly. Since disk I/O is expensive, a very thin index will require lots of I/O for little gain in information. Therefore you would prefer a rather densely populated index (e.g. keep the filled portion between ⅓ and ⅔ at all times), which can be problematic in terms of hash collisions. But even more problematic: as you insert values, such a dense index might become too full, and you'd have to increase the size of the hash table fairly often. Doing so will mean completely rebuilding the hash table. That's an expensive operation, which requires lots of disk I/O, and will likely block all concurrent queries on that table. Not something to look forward to.
A B-tree on the other hand can be extended without too much overhead. Even inceasing its depth, which is the closest analogon to an extension of the hash table size, can be done more cheaply, and will be required less often. Since B-trees tend to be shallow, and since disk I/O tends to outweight anything you do in memory, they are still the preferred solution for most practical applications. Not to mention the fact that they provided cheap access to ranges of values, which isn't possible with a hash.

- 57,380
- 22
- 148
- 276
-
Thanks for the thorough answer. Why would a thin index require lots of IO? – Despertar Oct 26 '13 at 09:36
-
1@Despertar: In a thin index, every page of the hash table would contain only very few keys. So caches would be almost useless, and most hash lookups would require fetching a page from disk. Only a dense index has chances of being cached properly. – MvG Oct 26 '13 at 10:45