3

I am looking for an efficient search algorithm, that, for a given set of strings searches a large buffer for any one match from the set of strings. Currently i know a few efficient single-string algorithms (i have used the Knuth before), but i don't know if they really help.

Here is what i am actually doing:

  • I have around 6-10 predefined strings, each around 200-300 characters (actually bytes, since i`m processing binary data)
  • The input is a large, sometimes few megabyte buffer
  • I would like to process the buffer, and when i have a match, i would like to stop the search

I have looked for multiple-string searching algorithms using a finite set of predefined patterns, but they all seem to revolve around matching ALL of the predefined strings in the buffer.

This post: Fast algorithm for searching for substrings in a string, suggested using the Aho–Corasick or the Rabin–Karp alogirthm.

I thought, that since i only need one match, i could find other methods, that are similar to the mentioned algorithms, but the constrains given by the problem can improve the performance.

user2281752
  • 145
  • 1
  • 9
  • You could probably construct a regular expression from the set of strings, and use that to search the data. I have no idea if that would be "efficient" though, it was just the first idea that popped into my mind. – Some programmer dude Dec 12 '14 at 12:13
  • @JoachimPileborg Alas, that probably won't work if the data is binary... – anandthakker Dec 12 '14 at 12:15
  • A regular expression will ultimately build a finite state machine and essentially conduct incremental matching by brute-force. So if you have such a library it will be easy to implement but not efficient. Certainly not as efficient a the algorithms the OP has already found. – Persixty Dec 12 '14 at 12:19
  • 1
    Similar question: http://stackoverflow.com/q/3183582/3004881 – Dan Getz Dec 12 '14 at 12:26
  • 1
    Boyer-Moore *might* help here. http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm Notice that algorithm gets an advantage from long patterns that contain a restricted set of values not common in the text. Your patterns are relatively long ~1% of the length of the text and if you choose an 'alphabet' of `long` the second condition might hold unless there's some close relationship between the patterns and text. NB: If your patterns are 'odd' lengths (not 0 mod `sizeof(long)` you will need to do a bit of last gasp checking on the trailing odd ends. Hope that makes sense. – Persixty Dec 12 '14 at 12:37
  • 2
    I was thinking that it might be possible to combine Aho–Corasick with Boyer-Moore. If there are no short search patterns, then a combination could be really fast. The result could make a really good research paper! – Klas Lindbäck Dec 12 '14 at 12:44
  • 1
    @DanAllen You would also need to handle alignment issues and a very large jump table. If you have the pattern "ABC" and a two-byte word, you would need to put both "AB" and "BC" into the jump table (note that fixing the alignment issue also fixes the 'odd length' issue). – Klas Lindbäck Dec 12 '14 at 12:57
  • @KlasLindbäck I've upvoted your comment because I think I am glossing over a significant amount of spherical myalgia. – Persixty Dec 12 '14 at 14:51
  • The problem is that even Aho-Corasick complexity is linear. If you manage to construct simplier algorithm that would end as soon as first match is found, the worst case scenario would still have linear complexity. Such solution would be simply more sensitive to data set. – Kyborek Dec 12 '14 at 15:08
  • "matching ALL of the predefined strings"-- No, you have misunderstood the Aho-Corasick algorithm. It finds the first occurrence of any string in the set. It's exactly what you want, IOW. – j_random_hacker Dec 12 '14 at 15:40

1 Answers1

1

Aho-Corasick is a good choice here. After building an automaton the input string is traversed from left to right so it is possible to stop immediately after the first match is found. The time complexity is O(sum of lengths of all patterns + the position of the first occurrence). It is optimal because it is not possible to find the first match without reading all patterns and all the bytes from the buffer before the first occurrence.

kraskevich
  • 18,368
  • 4
  • 33
  • 45
  • thanks for clearing it up. i will use this method then – user2281752 Dec 14 '14 at 19:52
  • just a quick note: implemented a test application, which works as i expected, but the following statement from the original paper bothers be a bit: "Our problem is to locate and identify all substrings of x which are keywords in K." Doesn't stat still mean, that it will go through all the input text and does not stop at the first match? – user2281752 Dec 17 '14 at 13:09
  • @user2281752 If you implement this algorithm by yourself, you can just stop after the first match. If you are using an existing implementation, it can look for all matches(but there nothing in the nature of this algorithm that prevents you from looking for the first match only). – kraskevich Dec 17 '14 at 13:33