Efficient Regular Expression for big data, if a String contains a word

Question

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.

What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?

For example:

String keyword="ac"; 
String document"..."  //few page long file

If i use :

if(document.contains(keyword) ){
//do something
}

It will also return true if document contains a word like "account";

so I tried to use regular expression as follows:

String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
   //do something
}

Summary:

This is the summary: Hopefully it will be useful to some one else:

My regular expression would work but extremely impractical while working with big data. (it didn't terminate)
@anubhava perfected the regular expression. it was easy to understand and implement. It managed to terminate which is a big thing. but it was still a bit slow. (Roughly about 240 seconds)
@Tomalak solution is abit complex to implement and understand but it was the fastest solution. so hats off mate.(18 seconds)

so @Tomalak solution was ~15 times faster than @anubhava.

Catastrophic backtracking is the keyword you are looking for. Also *"if a string contains a word"* is checked with `indexOf()`, not with regular expressions. — Tomalak, Jul 10 '14 at 10:38
If you want to match "ac" but not "account", wouldn't your keyword be " ac "? — Kayaman, Jul 10 '14 at 10:38
@Kayaman yes but its possible to have something like: "ac;" which also be matched — nafas, Jul 10 '14 at 10:39
Well you could multi-thread / parallelise it, but essentially the only real way to do this is index the documents and search the index which will of course be slowish, but only needed once per document. — Tony Hopkinson, Jul 10 '14 at 10:39
@Tomalak I'm not quite sure what you mean by using "indexOf()". could you explain abit further please? — nafas, Jul 10 '14 at 10:40
@nafas http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String) - in other words, can you explain why you use regex for this? — Tomalak, Jul 10 '14 at 10:42
@Tomalak well if I use indexOf, I would have the same problem with contains. as I said I if I search for "ac" and document contains "account", it would still matches. — nafas, Jul 10 '14 at 10:46
Wow, that *is* one improvement. If you really care, try my suggestion with the `Character` class. It would take some thought what amounts to a non-word character (for your input data), but I think you could squeeze out a bit more performance, too. — Tomalak, Jul 10 '14 at 12:25
@Tomalak I just used **{"," , " " , ";" , "\n" , "" , "\"" , "'" , " "}** as delimiters in a set. and improve the results by 4 seconds :D. so this time it took 14 seconds. — nafas, Jul 10 '14 at 12:55
Also see this Gist: https://gist.github.com/Tomalak/31c1c55c1c79430be5c7 — Tomalak, Jul 10 '14 at 12:59

anubhava · Answer 1 · 2014-07-10T10:48:46.303

4

Don't think you need to have .* in your regex.

Try this regex:

String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";

Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.

Also you must be using Pattern.quote if your keyword contains special regex characters.

EDIT: You might use this regex if your keywords are separated by space.

String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";

edited Jul 10 '14 at 10:48

answered Jul 10 '14 at 10:38

anubhava

761,203
64
569
643

(and if the keyword can contain those characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match). – Tim Pietzcker Jul 10 '14 at 10:39
1

@anubhava I tried your solution. its going much much faster. I'll update you in a bit :) – nafas Jul 10 '14 at 10:44
@anubhava `Pattern.quote()` is not optional, but *absolutely mandatory* in this situation. It's better to say "unless you have good reasons against it" than to indicate that it would somehow only be needed under certain conditions. – Tomalak Jul 10 '14 at 10:47
1

Thanks @Tomalak: I have recommended it strongly and added in the code also. – anubhava Jul 10 '14 at 10:49
1

@anubhava Thank you very much, The results are absolutely wonderful. Its still slow but at least I finished :D. I guess if I run it on a cloud I can reduce the time consumption into real time now. – nafas Jul 10 '14 at 10:51
2

For fun and profit I'm linking to this encyclopedic answer about Unicode and `\b` (among other things) in Java. http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions – Tomalak Jul 10 '14 at 10:52
@nafas: Glad to know it helped reducing overall time taken in running your program. – anubhava Jul 10 '14 at 11:02
@anubhava I thought you might be interested in this. check the summary on overall profermance – nafas Jul 10 '14 at 12:18

Tomalak · Accepted Answer · 2014-07-10T13:18:40.773

The fastest-possible way to find substrings in Java is to use String.indexOf().

To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:

public class IndexOfWordSample {
    public static void main(String[] args) {
        String input = "There are longer strings than this not very long one.";
        String search = "long";
        int index = indexOfWord(input, search);

        if (index > -1) {
            System.out.println("Hit for \"" + search + "\" at position " + index + ".");
        } else {
            System.out.println("No hit for \"" + search + "\".");
        }
    }

    public static int indexOfWord(String input, String word) {
        String nonWord = "^\\W?$", before, after;               
        int index, before_i, after_i = 0;

        while (true) {
            index = input.indexOf(word, after_i);
            if (index == -1 || word.isEmpty()) break;

            before_i = index - 1;
            after_i = index + word.length();
            before = "" + (before_i > -1 ? input.charAt(before_i) : "");            
            after = "" + (after_i < input.length() ? input.charAt(after_i) : "");

            if (before.matches(nonWord) && after.matches(nonWord)) {
                return index;
            }
        }
        return -1;
    }
}

This would print:

Hit for "long" at position 44.

This should perform better than a pure regular expressions approach.

Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.

For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.

I've created a Gist with an alternative implementation that does that.

this was great mate, I almost got same results but even faster using this. (almost is more than perfect while working with big data) — nafas, Jul 10 '14 at 11:55

Efficient Regular Expression for big data, if a String contains a word

2 Answers2

Linked