efficient algorithm for searching one of several strings in a text?

Question

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.

I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.

As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.

e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.

Any ideas or references are appreciated.

I don't know about boost: But most languages that use regular expressions. use the regular expressions to build the equivalent of a finite state machine that is used to parse the text so it is quite efficient. — Martin York, Aug 24 '10 at 20:01
Please post the regex you are using. There might be room for improvement there. — John Dibling, Aug 24 '10 at 20:04
@John Dibling: as indicated, I just concatenate the strings with the OR operator: `str1 | str2 | ...` — davka, Aug 25 '10 at 08:30
@jdv: I dunno :( I guess yes, I create a boost::expression object from string and use it to match. — davka, Aug 25 '10 at 08:32
as @maxschlepzig suggested, you should use Aho-Corasick algorithm, which is optimal for this case and runs in O( |str_1| + ... + |str_n| ) preprocess time plus O( |text| ) to find all matchings — Daniel, Dec 09 '16 at 16:12

score 6 · Answer 1 · answered Aug 24 '10 at 20:31

6

Check out the Aho–Corasick string matching algorithm!

answered Aug 24 '10 at 20:31

maxschlepzig

35,645
14
145
182

score 4 · Answer 2 · answered Aug 24 '10 at 19:48

4

Take a look at Suffix Tree.

answered Aug 24 '10 at 19:48

wilhelmtell

57,473
20
96
131

score 1 · Answer 3 · answered Aug 25 '10 at 06:22

I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.

What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.

I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).

Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.

score 1 · Accepted Answer · answered Aug 25 '10 at 10:53

Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html

The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.

The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.

Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.

Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)

Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.

When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().

Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.

So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.

score 0 · Answer 5 · answered Aug 24 '10 at 19:37

0

Regex engine initialization is expected to have some overhead, so if there are no real regular expressions involved, C - memcmp() should do fine.

If you can tell the file sizes and give some specific use cases, we could build a benchmark (I consider this very interesting).

Interesting: memcmp explorations and timing differences

Regards

rbo

answered Aug 24 '10 at 19:37

rubber boots

14,924
5
33
44

1

At the very end of the `memcmp` article: http://news.ycombinator.com/item?id=607954 which shows that Boyer Moore beats the crap out of an optimized `memcmp`... and I would probably not advise `memcmp` in C++, we have a `std::string` class with a `find` method after all, no need to deal with the nasties. – Matthieu M. Aug 25 '10 at 06:13
@Matthieu: good point, I'd like to test this in a scenario that matches the conditions of the O.P. - if he would describe them in some detail. – rubber boots Aug 25 '10 at 09:16

score 0 · Answer 6 · answered Aug 24 '10 at 19:51

0

There is always Boyer Moore

answered Aug 24 '10 at 19:51

doron

27,972
12
65
103

score 0 · Answer 7 · answered Aug 24 '10 at 20:30

0

Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.

answered Aug 24 '10 at 20:30

Christian Ammer

7,464
6
51
108

Do you say that because you've read the source to BOOST, or is there some other reason you know that BOOST doesn't do depth-first non-deterministic matching the way its user documentation would suggest? – Ian Aug 27 '10 at 12:19

score 0 · Answer 8 · answered Aug 27 '10 at 11:28

You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools. You can use Lex for getting tokens of the string. Compare your pre-defined strings with the tokens returned from Lexer. When match is found, perform the desired action. There are many sites which describe about Lex and Yacc. One such site is http://epaperpress.com/lexandyacc/

efficient algorithm for searching one of several strings in a text?

8 Answers8