4

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.

I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.

As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.

e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.

Any ideas or references are appreciated.

davka
  • 13,974
  • 11
  • 61
  • 86
  • Are you working with a precompiled regex object? –  Aug 24 '10 at 19:46
  • I don't know about boost: But most languages that use regular expressions. use the regular expressions to build the equivalent of a finite state machine that is used to parse the text so it is quite efficient. – Martin York Aug 24 '10 at 20:01
  • 1
    Please post the regex you are using. There might be room for improvement there. – John Dibling Aug 24 '10 at 20:04
  • @John Dibling: as indicated, I just concatenate the strings with the OR operator: `str1 | str2 | ...` – davka Aug 25 '10 at 08:30
  • @jdv: I dunno :( I guess yes, I create a boost::expression object from string and use it to match. – davka Aug 25 '10 at 08:32
  • 1
    as @maxschlepzig suggested, you should use Aho-Corasick algorithm, which is optimal for this case and runs in O( |str_1| + ... + |str_n| ) preprocess time plus O( |text| ) to find all matchings – Daniel Dec 09 '16 at 16:12

8 Answers8

6

Check out the Aho–Corasick string matching algorithm!

maxschlepzig
  • 35,645
  • 14
  • 145
  • 182
4

Take a look at Suffix Tree.

wilhelmtell
  • 57,473
  • 20
  • 96
  • 131
1

I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.

What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.

I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).

Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
1

Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html

The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.

The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.

Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.

Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)

Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.

When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().

Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.

So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.

Ian
  • 4,421
  • 1
  • 20
  • 17
0

Regex engine initialization is expected to have some overhead, so if there are no real regular expressions involved, C - memcmp() should do fine.

If you can tell the file sizes and give some specific use cases, we could build a benchmark (I consider this very interesting).

Interesting: memcmp explorations and timing differences

Regards

rbo

rubber boots
  • 14,924
  • 5
  • 33
  • 44
  • 1
    At the very end of the `memcmp` article: http://news.ycombinator.com/item?id=607954 which shows that Boyer Moore beats the crap out of an optimized `memcmp`... and I would probably not advise `memcmp` in C++, we have a `std::string` class with a `find` method after all, no need to deal with the nasties. – Matthieu M. Aug 25 '10 at 06:13
  • @Matthieu: good point, I'd like to test this in a scenario that matches the conditions of the O.P. - if he would describe them in some detail. – rubber boots Aug 25 '10 at 09:16
0

There is always Boyer Moore

doron
  • 27,972
  • 12
  • 65
  • 103
0

Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.

Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
  • Do you say that because you've read the source to BOOST, or is there some other reason you know that BOOST doesn't do depth-first non-deterministic matching the way its user documentation would suggest? – Ian Aug 27 '10 at 12:19
0

You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools. You can use Lex for getting tokens of the string. Compare your pre-defined strings with the tokens returned from Lexer. When match is found, perform the desired action. There are many sites which describe about Lex and Yacc. One such site is http://epaperpress.com/lexandyacc/

bjskishore123
  • 6,144
  • 9
  • 44
  • 66