3

I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf

There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.

Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?

EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.

aditya
  • 143
  • 6
  • There are a lot of ways to do it, the first question is: where the words coming from? – GodFather Jul 12 '11 at 12:54
  • the keywords is a fixed set, the searchable string will be different for each execution of the search. So I can do some preprocessing on the keywords. – aditya Jul 12 '11 at 12:57
  • 1
    What you can try doing is this. Put the keywords in an array; then do a customer-comparator search of the array. Your comparator function would check your searchable string against the individual keywords. Another option would be to prepare one big regex with all your keywords - it would look like this: `/keyword1|keyword2|keyword3|.../` - of course, you'd need to do proper escaping of special chars. Then you can simply use `preg_match()` to see if you have a match. – Aleks G Jul 12 '11 at 13:04
  • I Agree, if you have a fixed set of words you can use preg_match, but if you has a big set to search maybe it can be a little slow. – GodFather Jul 12 '11 at 13:12
  • creating a single regex expression for all the keywords is actually similar approach to the Aho-Corasick algorithm I mentioned. It will be non-trivial to do it in a standard algo way, since they are unrelated distinct words. – aditya Jul 12 '11 at 13:12
  • Another way is put the set on a database into a field type text and implement a fulltext search, will be more faster and the result can be more accurate. – GodFather Jul 12 '11 at 13:13

2 Answers2

1

I think you can solve this problem by using "Levenshtein distance" metric.

From wikipedia;

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences.

Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.

edigu
  • 9,878
  • 5
  • 57
  • 80
0

As of PHP 5.5, PHP's strtr uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.

A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick

Ori
  • 4,132
  • 1
  • 25
  • 27