4

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).

Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).

I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.

What's an efficient algorithm or method to perform this operation?

Sample Data:

The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:

  • good
  • bad
  • freed*
  • careless*
  • great loss

The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:

Just got the Google nexus without a contract. Hands down its the best phone 
I've ever had and the only thing that could've followed my N900.

While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.

Kurtis
  • 1,599
  • 1
  • 16
  • 30
  • 1
    Rabin-Karp is pretty easy to use (has good implementations in many languages) and is pretty fast, I think it's a fair choice. If you're trying to do full-text search you might want to look into engines that do that like Lucene. Good luck! – Benjamin Gruenbaum Mar 28 '13 at 19:29
  • 1
    Also related: http://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm?rq=1 – Benjamin Gruenbaum Mar 28 '13 at 19:30
  • I'll give Rabin-Karp a try first and see how it performs. I'll also read more into that long, but informative Stackoverflow post you linked. Thanks – Kurtis Mar 28 '13 at 19:33
  • 1
    No problem, you should wait for an answer though, I'm sure someone with more experience than me will have more insight on what fits best in _your_ case – Benjamin Gruenbaum Mar 28 '13 at 19:34
  • @Kurtis can you modify your question, add what patterns you need .. and for every pattern, expected outputs .. Abbreviation of words, Abbreviation of phrases, short forms of words, key-based guess, consent-based guess, prefix match // if your dictionary is pattern-based, then your options are limited – Khaled.K Apr 06 '13 at 14:02
  • Khaled, I'm not quite sure I understand what you're looking for. I can modify my question accordingly but I'd like to make sure I understand your needs first. So, let's say for example we had a Lexicon that included the set of Strings [good, bad, grat*] and a String "It was a good day. Too bad things are not going as planned. I'm still grateful, though.". I'd like to come up with the output of: [good: 3, bad: 9, grat*: 15]. I didn't emphasize location matching, which might be helpful, overall, but I don't want to sway away from the original question too much. Let me know if that helps at all. – Kurtis Apr 07 '13 at 00:48
  • I'm going to reward the bounty in 2 days. Unfortunately, I have only had time to evaluate one possible solution and so far it has worked out really well for me. Hopefully this will give people time to vote, add answers, or otherwise offer their opinion. Thanks everybody for helping me out. – Kurtis Apr 09 '13 at 17:35

7 Answers7

6

I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • I'm going to try this algorithm first as it has the most up-votes. I fore-see that wildcards may throw additional complexity (both computing-wise and development-wise) but I believe if I restrict my wildcard usage to only being allowed at the end of a string then it may very well keep things simple. I will report back with my results. Thanks for the suggestion! – Kurtis Apr 08 '13 at 16:05
  • I'm running out of time to evaluate all of the answers posted here. There's a lot of really great ideas. However, I tried out this method first and I believe it meets all of my requirements and is extremely efficient (once I fully understood how it works). I think I have even have room for more improvement so I'm very happy with this suggestion. Thanks a lot! – Kurtis Apr 09 '13 at 17:23
  • 1
    I'm glad A-C is working for you, but want to point out that it is designed to find all matches including overlapping ones. You will actually have to "dumb down" the algorithm (defeat this overlapping string-finding capability by anchoring on word boundaries) to do unique token matching, which is what a scanner DFA is designed to do. – Gene Apr 11 '13 at 02:27
  • @Gene Very good point. I decided to go ahead and keep this algorithm as generic as possible since we are dealing with human generated text. I'll leave it up to whomever is generating the lexicon to decide how they'd like to split the text. For example, if they want to introduce a phrase such as "good" but it should be surrounded by spaces then they'll simply have to insert it as " good ". Probably an ad-hoc and even inefficient method *but* it's something I can get working immediately without too much difficulty. I'm still going to look into the scanner you noted just as soon as I can. Thanks! – Kurtis Apr 11 '13 at 15:09
  • @Gene Also, I'll do some pre-processing on the text input to remove punctuation not exactly used for this purpose. For example: periods, semi-colons, etc... Of course this will all be done outside of my class which implements A-C. Hopefully this will yield good results – Kurtis Apr 11 '13 at 15:11
3

One answer I wanted to throw out there was the Boyer-Moore search algorithm. It is the algorithm that grep uses. Grep is probably one of the fastest search tools available. In addition, you could use something like GNU Parallel to have grep run in parallel, thus really speeding up the algorithm.

In addition, here is a good article that you might find interesting.

Icemanind
  • 47,519
  • 50
  • 171
  • 296
  • 1
    I've read the wikipedia entry and the fact that Grep (which, in my opinion, is extremely fast) takes advantage of this algorithm is a huge positive in my mind. Unfortunately, I don't have time to try and evaluate it before the bounty is up. However, you've definitely earned a +1 for including a valid algorithm, a link to the algorithm, and the suggestion (which, for me, is unusable but still appreciated) on Gnu Parallel. I also enjoyed reading the article and seeing some evaluation performed by someone else. Thanks you very much. +1 – Kurtis Apr 09 '13 at 17:31
2

You might still use your original idea, of checking every word in the text against the dictionary. However, in order to run it efficently you need to index the dictionary, to make lookups really fast. A trick used in information retrival systems is to store a so called permuterm index (http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html).

Basically what you want to do is to store in a dictionary every possible permutation of the words (e.g. for house):

house$
ouse$h
use$ho
...
e$hous

This index can then be used to quickly check against wildcard queries. If for example you have ho*e you can look in the permuterm index for a term beginning with e$ho and you would quickly find a match with house.

The search itself is usually done with some logarithmic search strategy (binary search or b-trees) and thus is usally very fast.

decden
  • 679
  • 4
  • 19
  • Unfortunately, I don't have time to evaluate this method. From my initial reading though, I think it's a viable option and will definitely look into it in the future. Thanks a lot! +1 – Kurtis Apr 09 '13 at 17:24
2

So long as the patterns are for complete words: you don't want or to match storage; spaces and punctuation are match anchors, then a simple way to go is translate your lexicon into a scanner generator input (for example you could use flex), generate a scanner, and then run it over your input.

Scanner generators are designed to identify occurrences of tokens in input, where each token type is described by a regular expression. Flex and similar programs create scanners rapidly. Flex handles up to 8k rules (in your case lexicon entries) by default, and this can be expanded. The generated scanners run in linear time and in practice are very fast.

Internally, the token regular expressions are transformed in a standard "Kleene's theorem" pipeline: First to a NFA, then to a DFA. The DFA is then transformed to its unique minimum form. This is encoded in a HLL table, which is emitted inside a wrapper that implements the scanner by referring to the table. This is what flex does, but other strategies are possible. For example the DFA can be translated to goto code where the DFA state is represented implicitly by the instruction pointer as the code runs.

The reason for the spaces-as-anchors caveat is that scanners created by programs like Flex are normally incapable of identifying overlapping matches: strangers could not match both strangers and range, for example.

Here is a flex scanner that matches the example lexicon you gave:

%option noyywrap
%%
"good"                    { return 1; }
"bad"                     { return 2; }
"freed"[[:alpha:]]*       { return 3; }
"careless"[[:alpha:]]*    { return 4; }
"great"[[:space:]]+"loss" { return 5; }
.                         { /* match anything else except newline */ }
"\n"                      { /* match newline */ }
<<EOF>>                   { return -1; }
%%
int main(int argc, char *argv[])
{
  yyin = argc > 1 ? fopen(argv[1], "r") : stdin;
  for (;;) {
    int found = yylex();
    if (found < 0) return 0;
    printf("matched pattern %d with '%s'\n", found, yytext);
  }
}

And to run it:

$ flex -i foo.l
$ gcc lex.yy.c
$ ./a.out
Good men can only lose freedom to bad
matched pattern 1 with 'Good'
matched pattern 3 with 'freedom'
matched pattern 2 with 'bad'
through carelessness or apathy.
matched pattern 4 with 'carelessness'
rici
  • 234,347
  • 28
  • 237
  • 341
Gene
  • 46,253
  • 4
  • 58
  • 96
  • While I can't exactly incorporate flex into my system (it runs on JVM and uses a framework), you have given a great answer and I have learned a lot! You've actually provided enough information that I might be able to develop my own implementation given the time. Further, using spaces in my situation is an important requirement that some algorithms do not explicitly address. I appreciate the very detailed answer. +1 – Kurtis Apr 09 '13 at 17:28
  • 1
    I should have said that there are many scanner generators out there. Flex is just the C example. For the JVM there is also http://www.cs.princeton.edu/~appel/modern/java/JLex/ . – Gene Apr 10 '13 at 01:25
1

This doesn't answer the algorithm question exactly, but check out the re2 library. There are good interfaces in Python, Ruby and assorted other programming languages. In my experience, it's blindly fast, and removed quite similar bottlenecks in my code with little fuss and virtually no extra code on my part.

The only complexity comes with overlapping patterns. If you want the patterns to start at word boundaries, you should be able partition the dictionary into a set of regular expressions r_1, r_2, ..., r_k of the form \b(foobar|baz baz\S*|...) where each group in r_{i+1} has a prefix in r_i. You can then short circuit evaluations since if r_{i+1} matches then r_i must have matched.

Unless you're implementing your algorithm in highly-optimized C, I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread.

Lucas Wiman
  • 10,021
  • 2
  • 37
  • 41
  • "I'd bet that this approach will be faster than any of the (algorithmically superior) answers elsewhere in this thread." You could very well be right :) Unfortunately, not only did I not have time to really evaluate this but I am also stuck on a JVM so it would add further complication to the system that I'm trying to avoid. Not that you didn't answer the original question :) At first, I was a little worried of Regex because there could be quite a few search terms. You included extra information which is very informative. I'll be referencing this in the future as I prefer Python. Thanks! +1 – Kurtis Apr 09 '13 at 17:34
0

Let me get this straight. You have a large set of queries and one small string and you want to find instances of all those queries in that string.

In that case I suggest you index that small document like crazy so that your search time is as short as possible. Hell. With that document size I would even consider doing small mutations (to match wildcards and so on) and would index them as well.

ElKamina
  • 7,747
  • 28
  • 43
  • You've got it for the most part. The only specification I would change is that I am "querying" multiple small strings but only one at a time. Maybe that makes a difference in my approach? I thought about breaking the small strings into various sub-strings and even generating some permutations of various sub-strings (1 word, 2 word, etc...). Basically along the lines of what you would consider doing. I'm just not sure on whether or not that's a good idea and what an efficient method of performing that operation would be. I'd be happy to hear more details! – Kurtis Mar 28 '13 at 19:58
  • @Kurtis '....but only one at a time.' Why should that make a difference? You have n queries and you want to minimize total processing time for all those queries. Whatever indexing you are building can be reused for each of them. Regarding the second point, I think it is a good idea. – ElKamina Mar 28 '13 at 20:09
  • 1
    @elkamina The example given was Tweets. You have to assume there are massively more tweets (possibly billions of times as many) than patterns in the dictionary, so precomputations on the dictionary will yield much bigger benefits than precomputations on an individual tweet. (Though both might be fruitful.) – Lucas Wiman Apr 05 '13 at 07:18
-1

I had a very similar task. Here's how I solved it, the performance is unbelievable http://www.foibg.com/ijita/vol17/ijita17-2-p03.pdf

  • Perhaps you can add some details from the link to your answer? Link-only answers are discouraged because they become useless if the link dies in the future. – PeterJ Mar 18 '15 at 08:25