whats the fastest string collection structure/algorithm for startswith and/or contains searches

Question

I have the following situation: I have a big collection of strings (lets say 250.000+) of average length of maybe 30. What I have to do is to do many searches within these .. mostly those will be of StartsWith and Contains kind.

The collection is static at runtime. Which means the initial reading and filling of the collection of choice is done only once .. therefore the performance of building the datastructure is absolutely not important. Memory is also not a problem: which also means that I don't mind having two collections with the same data in each if needed (like one for the startswith and another for contains). Only thing that matters is performance of the searches which should return all elements which match the searchcondition.

For startswith I came upon a Trie or Radix-tree .. but maybe there are even better choices?

For contains .. I have no good idea yet at all (beside running a linq query on a list which wont be very fast with that amount of data).

Thanks in advance everyone!

update: I forgot an important part: with Contains i mean no exact matches in the collection .. but i want to find all strings in the collection which contain the given searchstring

Will the substring for your Contains search deal with words, or individual characters? I wonder if building an index would make sense for that one. — Andrew Arnott, Mar 03 '13 at 22:58
It should support characters. Although for performance reasons i could imagine to give a minimum length of 3 or more chars before searching. (can think of it like an autocomplete in an textbox which kicks in only after some characters are entered already) — Mikk, Mar 03 '13 at 23:07
Search the web for "Rabin Karp". This should get you started as it has several search algorithms linked... http://www.stoimen.com/blog/2012/04/02/computer-algorithms-rabin-karp-string-searching/ Also think about using a bloom filter and preloading it with your strings at startup. — JimR, Mar 03 '13 at 23:16
thanks for the hints .. especially the bloom filter is new for me. i will read more about those tomorrow, time for bed here now. — Mikk, Mar 03 '13 at 23:35
Starts with. Store the strings in a sorted array and use binary search, easy to implement, understand and fast. No effect on contains though. — rlb, Mar 04 '13 at 08:48

score 4 · Accepted Answer · edited May 23 '17 at 11:57

4

Building a suffix tree will allow you to do a substring search on all your strings in parallel in O(1). The pedantic in me can't help but note that it's really O(n + m) where n is the number of strings that match your substring and m is the size of the substring being queried.

So what's a suffix tree you ask? In its most basic implementation, it's a trie with a fancier insert method: in addition to adding a string, it also adds every possible suffix of that string to the trie. On this data structure, a substring search becomes a prefix search of all possible suffixes. Since you also want to do prefix searches, you'll want to add a special character in front of each inserted string and the query substrings. The special character will allow you to differentiate between a suffix and a full string.

While this implementation of a suffix tree is remarkably simple, it's also very inefficient (O(n^2) space and build time). Fortunately, there are other more efficient implementations which can greatly reduce the space and time bounds. One of these, Ukkonen's algorithm, is very well explained in this SO answer and brings the space bound down to O(n). You may also want to look into suffix arrays which are an equivalent but more efficient representation of suffix trees.

While I know there are many many more implementations of suffix trees out there (one of which would probably hit the sweet spot for your use case) I just don't know them. I recommend you do some research on the subject before you settle on an implementation.

edited May 23 '17 at 11:57

Community

1
1

answered Mar 03 '13 at 23:59

Ze Blob

2,817
22
16

You are wrong about the inefficiency of suffix tree. A good implementation can improve to O(n) or O(n log n) time and O(n) space. http://en.wikipedia.org/wiki/Suffix_tree – nhahtdh Mar 04 '13 at 08:00
this sounds great so far! especially the idea with the special char for differentiating between suffix and prefixes! – Mikk Mar 04 '13 at 13:08
I will read more about it and try this for sure. Will there be a downside about the suffix arrays? If they are more efficient then i probably will focus on them right away. – Mikk Mar 04 '13 at 13:15
@nhahtdh I was refering to the plain and simple implementation that I described in the second paragraph. I am aware that the `O(n^2)` bound can be greatly improved using smarter algorithms which is why I added links to two alternative implementations in my answer. Turns out that suffix tree are pretty heavily researched at the moment and you can actually beat the `O(n)` space bound if you use a more heavily compressed variant which is not mentionned in the wikipedia article. – Ze Blob Mar 04 '13 at 14:00
@ZeBlob: You should mention that naive implementation is bad. Your post may cause confusion that suffix tree data structure always has bad performance. – nhahtdh Mar 04 '13 at 14:04
@nhahtdh Did a quick edit which hopefully should clarifies things a bit. I'll take another stab at it tonight when I have a bit more time. – Ze Blob Mar 04 '13 at 14:18
as I have a collection of strings (opposed to one big string/text) it seems the "normal" suffix tree wont be enough. I would need a generalized suffix tree/array to support the multiple strings I have. unfortunately I couldn't find a real .NET implementation of it so far. but http://www.cs.iastate.edu/~cs548/suffix.pdf is describing the theory very detailed. this is probably a really interesting riddle to solve .. i just need to see where i can find the time for it or if i have to settle with a faster(implementation) solution – Mikk Mar 04 '13 at 17:41
@Uwe Suffix tree would work just fine for a set of strings. The trick is to add a value to each leaf which indicates the string the leaf belongs to. That and the special character at the beginning should be all you need. – Ze Blob Mar 04 '13 at 20:57
thanks a lot! i will play with this soon i think .. it should become interesting. getting reminded of my uni-times and how i was sitting over red-black-trees.. – Mikk Mar 07 '13 at 12:13

whats the fastest string collection structure/algorithm for startswith and/or contains searches

1 Answers1

Linked