Speed optimization java string contains versus regex match

Question

How do I deliver the best performance(speed) to check if a sentence contains any keyword1, keyword2, keyword, etc.

Here are my options:

Use String.contains: if(string.contains(item1)||string.contains(item2)||string.contains(item3))
Or build a for loop for option #1 before the if-or-or-or above becomes out of control.
Use regex
Another option is using Java 8 Streaming API Which is not available for me at the moment. The client uses Java 7

Try them all; benchmark them correctly, and choose the one that's best for you. — Andy Turner, Jan 24 '17 at 21:01
You might want to look at [Aho-Corasick](https://en.m.wikipedia.org/wiki/Aho–Corasick_algorithm), which will be better at searching for all the options at once. — Andy Turner, Jan 24 '17 at 21:06

score 2 · Answer 1 · edited May 23 '17 at 12:33

First off, every answer should be subject to testing in production conditions. When performance is the question, RAM and cache sizes, bus speeds etc. come into play and make things hard to predict. Another question is how many times will this code run - the JVM will initially run an interpreted version of it, and only after the code has been executed enough times will replace that with a compiled (and faster) version.

Having said that, here are some pointers:

If you have a lot of keywords, consider parallelizing the task. Use an executor, or a parallel stream. That would only work for about 100+ keywords, and make your code slower for smaller amounts of keywords.
If the keywords are used often enough, try using some algorithm to search all of them, e.g. using a prefix-tree (a.k.a. trie). Note that these structures can cause inefficient memory use, as the node objects may be scattered in memory and thus cause cache misses while being traversed. This is why an ArrayList is faster than a LinkedList in practice, even though they have the similar properties in theory.
Try switching to byte arrays (i.e. using String.getBytes), and then using the methods of Arrays class to find each word. This has the advantage of memory locality. Note that Unicode may be tricky here, so you may need to normalize it first.

But above all, test. Just make sure you're doing your micro-benchmarks properly.

score 0 · Answer 2 · answered Jan 24 '17 at 21:58

I advise you to use regexp because it is very simple and powerful

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "STRING1|STRING2|STRING3";
final String string = "xxxSTRING1xxxSTRING2xxx";

final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println("Full match: " + matcher.group(0));
    for (int i = 1; i <= matcher.groupCount(); i++) {
        System.out.println("Group " + i + ": " + matcher.group(i));
    }
}

STDOUT:

Full match: STRING1
Full match: STRING2

DEMO in online IDE: here

Speed optimization java string contains versus regex match

2 Answers2