10

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.

Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.

So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.

My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.

wordFromInput.contains(wordFromPattern);

To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.

Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...

not possible with hash table

for what I am trying to do with a substring.

I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:

1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)

I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.

Here is a snippet of my code though if anyone wants to see my issue more concretely.

While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().

I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.

So to clean it up:

  • lines of text from chat logs (2000-10,000+), haystack
  • 1600+ patterns, needle(s)
  • mostly using Korean characters, although some English is included
  • Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.

I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.

Community
  • 1
  • 1
Nopiforyou
  • 350
  • 1
  • 7
  • 20
  • How does `java.util.regex` fit in? – aliteralmind Mar 01 '14 at 21:49
  • Sorry, because I deleted a lot of parts outside the matchWords(), there is a lot of things that might look a bit confusing. But I used a regular expression in the beginning to eliminate white space in sentences so I can process individual words. EDIT: Or are you suggesting that I use java.util.regex? – Nopiforyou Mar 01 '14 at 21:52
  • 3
    As I understand it, this question is arguably a better fit for Programmers, in the sense that it is a white-board discussion (versus "why does this code not work?"). – Michael Easter Mar 01 '14 at 21:57
  • I'm wondering if regex is a possible solution, but I don't quite have my mind around your problem yet. I didn't want to get too far into thinking about it if it was excluded to begin with. – aliteralmind Mar 01 '14 at 22:03
  • 1
    Have you considered using a multi-pattern matchign algorithm like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm or building a generalized suffix tree of your pattern set? – Niklas B. Mar 01 '14 at 22:16
  • So let me reiterate. You have a set of patterns (a considerable amount of them, say `p`) that are of similar size (say `m`) and don't change often. You want to optimize the time to find occurences of any of those patterns in a given string of size `n`. Is that about correct? If yes, http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm is exactly what you need. Basically you need an augmented DFA that remembers in its end states which patterns the end state corresponds to. With `O(p*m)` preprocessing you get `O(n)` search time per haystack string, which is optimal. – Niklas B. Mar 01 '14 at 22:28
  • @NiklasB. Yes, you're assumptions are correct. I am looking intently at the Aho-Corasick algorithm, and I think I'll give it a shot. It does look like it fits a lot of my needs. My one concern is would there be any severe disadvantages of using this with Korean? In Korean, the individual characters might be more complex than those in English. 박근혜 is three characters. The first two characters each has one vowel, two consonants, and the last has one vowel and one consonant. Or would a few more internal nodes in the FSM not matter? – Nopiforyou Mar 01 '14 at 22:37
  • @Nopiforyou: Well you can always use hash tables or binary search trees instead of lookup tables for the transitions, although that might reduce the performance a bit. That said, you only have at most `n*p` nodes in any case, the problem is more that of labeling the edges and finding them when given a character – Niklas B. Mar 01 '14 at 22:41
  • By the way, if you can tokenize your text so that the substrings you are looking for actually appear as tokens, that is obviously preferrable. – Niklas B. Mar 01 '14 at 23:03
  • @NiklasB. So you mentioned the idea of using tokenizing my text with substrings a few times now. I just wanted to clarify about this. So if I used String.split, of any of the substrings in the line of text, along with the whitespace that I'm already splitting with, wouldn't that just nearly the same usage of regex as zmbq was suggesting? Would String.split scale well with so many different tokens (substrings) to look for? – Nopiforyou Mar 02 '14 at 04:22
  • @Nopiforyou: That's not what I meant. Say your test is "bli bla blub-trala". After tokenizing you would get {"bli", "bla", "blub", "trala"} and you can look those tokens up *literally* without any string matching. I don't see why it's not possible in your case to canonically split the text into words that you can look up in the dictionary. Do you have word combinations in there or what is the problem? – Niklas B. Mar 02 '14 at 04:29
  • @NiklasB. Sorry, but I still don't think I understand you clearly. Could you show just a couple of lines of code on pastebin and link it if possible? At the moment, I split by whitespace to get individual words from the String (line of text). String[] words = s.split("\\s+"); And then I use contains with each word with pattern words that I'm looking for. But it seems like you're saying I can skip this last step and do "look those tokens up literally without any string matching." How would I do that though unless its split with each and every word from the bank of words I'm looking for? – Nopiforyou Mar 02 '14 at 04:45
  • Niklas is likely referring to the [StringTokenizer class](http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html) in Java. – Nuclearman Mar 02 '14 at 05:36
  • @Nopiforyou: What I don't understand is why you are not *literally* matching the splitted words. In what cases can it happen that a pattern is a *substring* of a word, but not the word itself. I mean for example if your pattern is "banana" and your word is "asdasdbanana", why are you interested in that? Why don't you just split the string in such a way that you can just check all the words for *equality* with any of the dictionary words, rather than matching proper substrings. – Niklas B. Mar 02 '14 at 06:27
  • In other words, why can't you just do [something like this: http://pastie.org/8816329](http://pastie.org/8816329) – Niklas B. Mar 02 '14 at 06:36
  • Oh, okay it's explained in your other answer. You should have included all the details here as well because the top-voted answer assumes that such a splitting is possible. – Niklas B. Mar 02 '14 at 06:38

7 Answers7

4

You can replace the hashtable with a Trie.

Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.

This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.

However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.

Nuclearman
  • 5,029
  • 1
  • 19
  • 35
  • Well hash tables have the same complexity as a trie, at least with high probbility (if implemented correctly). I think the problem is that OP is *not* currently tokenizing his string, for whatever reasons that is – Niklas B. Mar 02 '14 at 02:01
  • A fair point about the complexity. Although, I was thinking more along the lines of the possibility of faster rejection with a Trie, but the need to check each character makes it something of a minor improvement. Certainly though, not Tokenizing is going to be rather problematic for the complexity. – Nuclearman Mar 02 '14 at 02:22
  • 1
    @Nuclear: Since you're already at it, I wonder why you didn't suggest a generalized suffix tree / Aho-Corasick automaton of the pattern strings, which *also* let's you solve queries in `O(n)` but with a true substring search (no tokenization). You can also compute aggregations over the matches, like match count or a total "score" associated with the matches (as OP does in his `O(p*n)` implementation). – Niklas B. Mar 02 '14 at 02:36
  • I wasn't familiar with the algorithm, and didn't care to take the time to extend the algorithm into something that would be pretty much identical at the time I was writing it since it wasn't clear it would be needed and may have had a significant space cost depending on how it was implemented. However, looking at it a bit more closely (shortly before you posted), it does seem ideal for the purpose. – Nuclearman Mar 02 '14 at 02:41
3

What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.

There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.

So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.

The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • I have been pretty afraid of using regex simply because I'm not well versed in it. Additionally, I'm a bit concerned that the performance might not improve all that significantly. – Nopiforyou Mar 01 '14 at 22:05
  • Well, the regex is pretty simple - if your bank consists of 'red', 'blue' and 'apple', your regex should be `red|blue|apple`. – zmbq Mar 01 '14 at 22:07
  • 1
    Oh, and you won't know about performance until you try it. So give it a try. – zmbq Mar 01 '14 at 22:08
  • @zmbq: It's optimized for its use case, but not for OPs use case, where you want to find multiple patterns in the string – Niklas B. Mar 01 '14 at 22:14
  • Once you find a pattern, you can look again in the rest of the string - maybe even using the old slower way, as patterns will generally *not* be found. – zmbq Mar 01 '14 at 22:15
  • @zmbq: What I'm saying is, say you have `p` patterns of length `m`. There are algorithms that require a one-time preprocessing of `O(p*m)` and then find all the occurences of any of the patterns in `O(n)` for any given input of size `n`. Using a "dumb" string matching algorithm (even a good one), you will need at least `O(p*n)` per haystack to achieve the same. – Niklas B. Mar 01 '14 at 22:20
  • @zmbq: And as you can see from the question, OP already measured the matching speed and considers it too slow. – Niklas B. Mar 01 '14 at 22:21
  • @Niklas, that is exactly what I'm suggesting - Creating a regular expression that will match *all* the patterns together, not one by one. Once that is done, matching it will take O(N) and instead of O(P*N). – zmbq Mar 01 '14 at 22:23
  • @zmbq: Hm, I only saw your suggestion about `String.contains`. I doubt that the regex engine in Java will actually create a DFA for this particular regex, at least from my experience it will be more like `O(p*n)` – Niklas B. Mar 01 '14 at 22:24
1

You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)

You can use a HashMap for all the possible words, you can then dissect the words later.

e.g. say you need to match red and apple, you can combine the sum of

redapple = 1
applered = 0
red = 10
apple = 15

This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I'm a bit confused. So the words that I need to match comes from the text. So in this case, red and apple is from the user text (haystack) and isn't the pattern (needle) But wouldn't this be difficult to scale from all of the word combinations that I have to make from a sentence? I might be misunderstanding your answer. – Nopiforyou Mar 01 '14 at 22:10
  • @Nopiforyou you only need all the words you see, then you pattern match those matches. Then you only need to consider the number of unique words, not all the words. – Peter Lawrey Mar 01 '14 at 22:19
1

I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied

I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The

if (first letter of word matches keyword's first letter)//we have a candidate
    skip to last letter of the current word //see comment below
    if(last letter of word matches keyword's last letter)//strong candidate
        iterate backwards to start+1 checking remainder of letters

The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.

However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.

If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.

  • I think this optimization would be pretty limited by Korean, although it might work for English. A lot of the characters at the end, especially for verbs, makes it hard to distinguish the word itself. – Nopiforyou Mar 01 '14 at 22:19
1

The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.

At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.

Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.

private static class Matcher {
    private final int where;
    private final String s;
    private int i = 0;

    public Matcher ( String s, int where ) {
        this.s = s;
        this.where = where;
    }

    public boolean match(char ch) {
        return s.charAt(i++) == ch;
    }

    public int matched() {
        return i == s.length() ? where: -1;
    }
}

// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";

public void test() {
    // Use a LinkedList because it is O(1) to remove anywhere.
    List<Matcher> matchers = new LinkedList<> ();
    int pos = 0;
    for ( char c : test.toCharArray()) {
        // Fire off all of the matchers at this point.
        for ( String s : watchFor ) {
            matchers.add(new Matcher(s, pos));
        }
        // Discard all matchers that fail here.
        for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
            Matcher m = i.next();
            // Should it be removed?
            boolean remove = !m.match(c);
            if ( !remove ) {
                // Still matches! Is it complete?
                int matched = m.matched();
                if ( matched >= 0 ) {
                    // Todo - Should use getters.
                    System.out.println("    "+m.s +" found at "+m.where+" active matchers "+matchers.size());
                    // Complete!
                    remove = true;
                }
            }
            // Remove it where necessary.
            if ( remove ) {
                i.remove();
            }
        }
        // Step pos to keep track.
        pos += 1;
    }
}

prints

flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2

There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
  • 1
    More interesting is a DFA that actually matches all the patterns at once. That's exactly what the [Aho-Corasick](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) automaton does. – Niklas B. Mar 02 '14 at 02:03
  • Also your implementation has to manage up to `O(m)` matchers per pattern where `m` is the pattern size. So you basically do a `O(n*m)` string matching with a very high constant factor (lots of allocations), when `O(n)` matching is pretty easy to do (for example using `String.contains`). You'd need to build a failure function into the automaton so that you at least only need one per pattern (this gives you the Knuth-Morris-Pratt algorithm if implemented well) – Niklas B. Mar 02 '14 at 02:12
  • @NiklasB. - You are correct - there are better algorithms. This implementation is not optimal, it is primarily for demonstration of the technique of presenting each character to a nest of matchers instead of asking each matcher to search the text in turn. – OldCurmudgeon Mar 02 '14 at 18:00
1

This is a pretty broad question, so I won't go into too much detail, but roughly:

Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.

After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.

The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):

private Map<String, Set<Integer>> index; // pre-populated

Set<Integer> search(String... topics) {
    Set<Integer> results = null;
    for (String topic : topics) {
        Set<Integer> hits = index.get(topic);
        if (hits == null)
            return Collections.emptySet();
        if (results == null)
            results = new HashSet<Integer>(hits);
        else
            results.retainAll(hits);
        if (results.isEmpty())
            return Collections.emptySet(); // exit early
    }
    return results;
}

This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.

Bohemian
  • 412,405
  • 93
  • 575
  • 722