Algorithm to find first occurrence of each of the given pattern in each of the given input text

Question

I am looking for algorithm for which implementation is readily available in C. My input consists of many patterns & many texts. I want to find first occurrence of each pattern in each of the texts.

I am exploring string matching algorithms from here: http://igm.univ-mlv.fr/~lecroq/string/

But not sure of the best possible solution. Does anybody know the best matching algorithm for this use case?

My patterns will be in the range of 10-15 chars & texts will be in the range of 30-40 chars.

Also some of the stackoverflow answers mention that Boyes-Moore & KMP don't necessarily performs better than the strstr() because of modern day HW architectures. Will that be true for my peculiar use case as well?

Here is another list of algorithms. http://www.dmi.unict.it/~faro/smart/algorithms.php

more info here: http://stackoverflow.com/questions/7938033/string-pattern-matching-in-java — aquemini, Feb 03 '14 at 12:02
If the patterns are a simple strings (without regular expression wildcards) algorithms requiring a preprocessing of the pattern won't have any advantage as compared to a simple strstr, given that the texts have only up to four times that length. — laune, Feb 03 '14 at 13:14
Right now I will have strings with simple characters. But suppose if the problem extends to patterns containing wildcards or regex how it can be tackled? — Ajay Bodhe, Feb 03 '14 at 14:28

score 0 · Answer 1 · edited Feb 03 '14 at 13:28

0

Use modification of the Rabin-Karp algorithm:

Compute minimal pattern length for your patterns (10 in the your example)
For each pattern, create "snake-hash" with prefix length = 10. Create hashtable for your patterns.
For each text, "move snake length=10", and recompute hash. If hash match any prefix - compare full string.

So, this algorithm is ~O(n+m); N=texts length; M=patterls length.

For hashing, I recommend to use pair "cyclic_shift + XOR".

edited Feb 03 '14 at 13:28

Bernhard Barker

54,589
14
104
138

answered Feb 03 '14 at 13:13

olegarch

3,670
1
20
19

1

Thanks. How about the Aho-Corasick algorithm for this case? To the best of my understanding it generates tries out of patterns & enable parallel searching in text with optimal jump decisions? – Ajay Bodhe Feb 03 '14 at 14:25
It will work, also. But, I think, Rabin-Carp for your case more stable (it more resistant to common sub-strings in patterns), and more quick in linear case - move forward to next char is just two XORs and one shift. Do not need reset FSA, etc. Especially, if you will use uint32_t hash accumulator, shift by 4, and pattern-length=8. – olegarch Feb 03 '14 at 19:39

Algorithm to find first occurrence of each of the given pattern in each of the given input text

1 Answers1