1

I am looking for algorithm for which implementation is readily available in C. My input consists of many patterns & many texts. I want to find first occurrence of each pattern in each of the texts.

I am exploring string matching algorithms from here: http://igm.univ-mlv.fr/~lecroq/string/

But not sure of the best possible solution. Does anybody know the best matching algorithm for this use case?

My patterns will be in the range of 10-15 chars & texts will be in the range of 30-40 chars.

Also some of the stackoverflow answers mention that Boyes-Moore & KMP don't necessarily performs better than the strstr() because of modern day HW architectures. Will that be true for my peculiar use case as well?

Here is another list of algorithms. http://www.dmi.unict.it/~faro/smart/algorithms.php

Ajay Bodhe
  • 51
  • 1
  • 11
  • more info here: http://stackoverflow.com/questions/7938033/string-pattern-matching-in-java – aquemini Feb 03 '14 at 12:02
  • If the patterns are a simple strings (without regular expression wildcards) algorithms requiring a preprocessing of the pattern won't have any advantage as compared to a simple strstr, given that the texts have only up to four times that length. – laune Feb 03 '14 at 13:14
  • Right now I will have strings with simple characters. But suppose if the problem extends to patterns containing wildcards or regex how it can be tackled? – Ajay Bodhe Feb 03 '14 at 14:28

1 Answers1

0

Use modification of the Rabin-Karp algorithm:

  1. Compute minimal pattern length for your patterns (10 in the your example)

  2. For each pattern, create "snake-hash" with prefix length = 10. Create hashtable for your patterns.

  3. For each text, "move snake length=10", and recompute hash. If hash match any prefix - compare full string.

So, this algorithm is ~O(n+m); N=texts length; M=patterls length.

For hashing, I recommend to use pair "cyclic_shift + XOR".

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
olegarch
  • 3,670
  • 1
  • 20
  • 19
  • 1
    Thanks. How about the Aho-Corasick algorithm for this case? To the best of my understanding it generates tries out of patterns & enable parallel searching in text with optimal jump decisions? – Ajay Bodhe Feb 03 '14 at 14:25
  • It will work, also. But, I think, Rabin-Carp for your case more stable (it more resistant to common sub-strings in patterns), and more quick in linear case - move forward to next char is just two XORs and one shift. Do not need reset FSA, etc. Especially, if you will use uint32_t hash accumulator, shift by 4, and pattern-length=8. – olegarch Feb 03 '14 at 19:39