What are the fastest-performing options for a read-only, unordered collection of unique strings?

Question

Disclaimer: I realize the totally obvious answer to this question is HashSet<string>. It is absurdly fast, it is unordered, and its values are unique.

But I'm just wondering, because HashSet<T> is a mutable class, so it has Add, Remove, etc.; and so I am not sure if the underlying data structure that makes these operations possible makes certain performance sacrifices when it comes to read operations -- in particular, I'm concerned with Contains.

Basically, I'm wondering what are the absolute fastest-performing data structures in existence that can supply a Contains method for objects of type string. Within or outside of the .NET framework itself.

I'm interested in all kinds of answers, regardless of their limitations. For example I can imagine that some structure might be restricted to strings of a certain length, or may be optimized depending on the problem domain (e.g., range of possible input values), etc. If it exists, I want to hear about it.

One last thing: I'm not restricting this to read-only data structures. Obviously any read-write data structure could be embedded inside a read-only wrapper. The only reason I even mentioned the word "read-only" is that I don't have any requirement for a data structure to allow adding, removing, etc. If it has those functions, though, I won't complain.

UPDATE:

Moron's answer is an excellent example of the sort of thing I'm looking for. A Trie* definitely seems like a great possibility for the following reason: HashSet<T>.Contains depends on the GetHashCode function of some IEqualityComparer<string>, which, as far as I can tell, is O(n)** by default in .NET. In other words, every character in a string must be examined for HashSet<string>.Contains to return either true or false. For a Trie, only a return value of true would take O(n) to determine; a return value of false could potentially return much more quickly.

This is of course hypothetical. So far I have not written or come across a Trie implementation in .NET that can beat a HashSet<string> at Contains (though an implementation I wrote myself came quite close for the alphabet 'a' to 'z'). I'm just saying, it seems possible.

_{*That link, by the way, also led me to another intriguing/similar possibility: DAWG.}
_{**Here "n" is referring to the length of the string.}

But with a trie it's not a simple yes-or-no, you have to search through the branches until you find the node you want (or find out it isn't there). So while your comparisons may be quicker, you may make more of them, and you'll have additional overhead (node dereferencing) for each comparison. So you've got the same O(n) comparison for the final node you get with a hash, plus the comparisons you made before you got to the final node. So the only way a trie is better is if it's only got strings you're *not* looking for... — TMN, Jun 17 '10 at 18:34
@TMN: Right, my point was that for the negative case (the string isn't there), you could potentially figure that out much more quickly. Like if your string is "abcdefghijklmnop," but your Trie doesn't have any strings that start with "abc," you could *potentially* get a result faster than you would with a hashset. Your point about the additional overhead of node dereferencing, etc. is duly noted, though. — Dan Tao, Jun 17 '10 at 18:43
@TMN: P.S., your closing remark actually captures pretty well why I'm asking this question in the first place: "...strings you're *not* looking for" -- the scenario is this: the structure may be queried *many, many, many* times *very, very, very* quickly, and a **lot** of the inputs might not be in the collection, thus returning `false`. So if a data structure exists that can often or sometimes outperform a hashset in the negative case (false), even if it's drastically unmatched in the positive case (true), it still is worth it for me to investigate it. — Dan Tao, Jun 17 '10 at 18:47
@TMN: Don't agree the node lookup will make tries worse. What would a hash function do? There must be some computation involved with each character! And then there are additional lookups in the hash chains. Since it is read-only, we can optimize the heck out of the tries' layout (we can do the same with hash too, I admit). — , Jun 17 '10 at 20:28
@Moron: I was considering more the fact that entries in overflow chains in hash tables tend to be adjacent, so the likelihood of multiple entries being already in the D-cache is higher than with a trie. Tries always cluster in insertion order, and of course if you do any balancing all bets are off. ;) — TMN, Jun 18 '10 at 11:53
@TMN. Well, hash tables could screw with the cache too. If the queries which come in are 'close' like hell/help etc, then hashing will jump around (feature of good hashing!), while tries will keep it close. Depends on the access patterns I suppose. Like OP said a DAWG might be even better. Anyway, this is just talk. Things need to be measured :-). — , Jun 18 '10 at 12:03

score 2 · Answer 1 · answered Jun 17 '10 at 17:33

2

Tries are good for doing a Contains, especially for strings from a finite alphabet. Given a string s, the time complexity for Contains on a trie is O(|s|) (|s| = length of s), which is optimal.

answered Jun 17 '10 at 17:33

I was actually thinking of something just like this myself. Definitely seems like a good option: fixed on the characters 'a' to 'z' (I attempted an implementation that was case-insensitive), I was able to get something together that performed about *as* fast as a `HashSet`. Obviously, though, if it's only *as* fast, it's not really worth implementing. Still, despite the limitations, this definitely seems like a promising option. I'll have to explore it further. – Dan Tao Jun 17 '10 at 17:57
@Dan. I am surprised that it wasn't faster than HashSet. Does HashSet's/string's hashing function not look at all the characters in the string? Anyway, not sure what tests you did, but I suppose it is relevant to your scenarios. In general, what you say is kind of suprising. It all depends on the hash function used, though. For instance, if the hash function looks at each character in the string, tries _will probably_ outperfom hash (if implemented properly). – Jun 17 '10 at 20:26
@Moron: It could of course just be that my implementation sucks ;) I do intend to revisit this soon, in any case. – Dan Tao Jun 17 '10 at 21:00

score 1 · Answer 2 · answered Jun 17 '10 at 16:25

1

Apart from your wondering Hashset is the fastest collection.

There's no faster method because the underlying Hashtable allows O(1) read-write-access

answered Jun 17 '10 at 16:25

Tobias P.

4,537
2
29
36

1

There might nevertheless be a faster collection, e.g. one which is always twice as fast. – ChrisW Jun 17 '10 at 16:26
1

O(1) doesn't mean "as fast as possible"; it just means that the cost doesn't increase with the size of the input (in this case, the size of the collection). I'm certainly open to the possibility that you're right, though. But I'm hoping to learn about some other options. – Dan Tao Jun 17 '10 at 16:30
Yeah you'r right, but direct access - which a hashtable provides - is nerverless the fastest access method – Tobias P. Jun 17 '10 at 16:32

score 1 · Answer 3 · answered Jun 17 '10 at 17:00

A hashing container approaches O(1) for insert and retrieval, so from an order-of-magnitude perspective you can't get much better than that.

Within a hash container, your performance over time is going to be related to two things: how good of a distribution your hash function provides, and how fast it can compute it. These are not equivalent - a poorly distributed function (where you end up with a lot of collisions) is going to be much more performance-impacting than a slower but better distributed hash function.

Thus, if you could come up with a perfect hash function that was also extremely fast to compute, that would be an improvement. It's possible that constraining the data in specific ways may make that easier. But, odds are you, whatever you come up with will not be as good as what already exists.

score 1 · Answer 4 · answered Jun 17 '10 at 18:18

Hashing tables are amortized O(1) for lookup. Can't do better than that, O(1/n) algorithms are perpetual motion devices. There are only two things that make them behave poorly:

A poor hashing function that causes many collisions. The worst one will degenerate lookup to O(n). You'll have no trouble with strings, they hash really well. String.GetHashCode() does a terrific job.
A collection that is heavily mutated with many removed items that were added early. This can causes many empty hash buckets that need to be skipped by iterators. Degradation to O(n) is technically possible although quite rare. A simple workaround is to rebuild the collection by reassigning the reference (like table = new HashSet(table); )

These kind of problems are rare. You don't design for them up front (other than the hash function), you start considering them only when you detect perf problems with the program.

What are the fastest-performing options for a read-only, unordered collection of unique strings?

4 Answers4