what is the fastest substring search method in Java

Question

I need to implement a way to search substring (needles) in a list of string (haystack) using Java.

More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu".

In Java, as I know, there are 3 build-in method to see search substring in a string.

string.contains()
string.indexOf()
regular expression. it is something like string.matches("ja"))

My question is: What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.

I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.

Why do you think that the substring search is a performance problem? — chrylis -cautiouslyoptimistic-, Aug 20 '13 at 16:19
good one here http://stackoverflow.com/questions/5296268/fastest-way-to-check-a-string-contain-another-substring-in-javascript — Venkata Krishna, Aug 20 '13 at 16:20
It should not be to complicated to set up some simple performance tests yourself! — FrankPl, Aug 20 '13 at 16:23
You may also want to look into a trie: http://en.wikipedia.org/wiki/Trie — yshavit, Aug 20 '13 at 16:38

score 30 · Answer 1 · answered Aug 15 '16 at 06:33

The accepted answer is not correct and not complete.

indexOf() does a naive string search using backtracking on mismatches. This is quite fast on small patterns/texts but shows very poor performance on large texts
contains("ja") should be comparable to indexOf (because it delegates to it)
matches("ja") will not deliver the correct result, because it searches for an exact match (only the string "ja" will match exactly)
Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("jack"); m.find(); would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficient way using only the java api. This is because a constant pattern (like "ja") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)

If you want `indexOf`/`contains` like semantics, you should use `Pattern.compile("ja", Pattern.LITERAL)` to ensure that the string is treated with no regex pattern interpretation. Even if the string does not contain special chars, it may save some CPU cycles, as the `compile` method won’t even search for them. — Holger, Mar 09 '23 at 12:48

chrylis -cautiouslyoptimistic- · Answer 2 · 2013-08-20T16:29:42.397

6

As far as the three you asked about, a regular expression is going to be much slower because it requires putting together a full state machine when you have a much simpler target. For contains vs indexOf...

2114 public boolean contains(CharSequence s) {
2115     return indexOf(s.toString()) > -1;
2116 }

(i.e., contains just calls indexOf, but you might incur an extra String creation on each invocation. This is just one implementation of contains, but since the contract of contains is a simplification of indexOf, this is probably how every implementation will work.)

edited Aug 20 '13 at 16:29

answered Aug 20 '13 at 16:22

chrylis -cautiouslyoptimistic-

75,269
21
115
152

3

There is no extra String creation - either `s` is a `String`, and `toString()` will return `this` (so no object creation) - or `s` is not a `String` in which case you need to turn it into a `String` before calling `indexOf` anyway. So if what you need is `contains`, just use `contains`. – assylias May 21 '18 at 14:59

score 4 · Accepted Answer · edited May 21 '18 at 16:34

4

String[] names = new String[]{"jack", "jackson", "jason", "dijafu"};
long start = 0;
long stop = 0;

//Contains
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].contains("ja");
}
stop = System.nanoTime();
System.out.println("Contains: " + (stop-start));

//IndexOf
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].indexOf("ja");
}
stop = System.nanoTime();
System.out.println("IndexOf: " + (stop-start));

//Matches
start = System.nanoTime();
for (int i = 0; i < names.length; i++){
    names[i].matches("ja");
}
stop = System.nanoTime();
System.out.println("Matches: " + (stop-start));

Output:

Contains: 16677
IndexOf: 4491
Matches: 864018

edited May 21 '18 at 16:34

Daxtron2

1,239
1
11
19

answered Aug 20 '13 at 16:26

Brinnis

906
5
12

6

To be fair, you should compile a `Pattern` once and reuse it. Calling `String.matches(String)` in a loop for the same regex is inefficient. `Pattern p = Pattern.compile("ja"); for(String s : names) p.matcher(s).matches();` – Dev Aug 20 '13 at 16:30
1

Since it is only 4 it really makes to significant difference. The variation between runs is larger than the difference switching to creating the pattern outside of the for loop. – Brinnis Aug 20 '13 at 16:38
5

This solution is - even if accepted - not correct. First: `matches()` is used in a wrong way. Second the test samples are biased (preferring indexOf). Third: The benchmark is handwritten (see http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). I will write a separate solution to correct these facts. – CoronA Aug 15 '16 at 06:17
10

This benchmark isn't worth anything. An incorrectly built micro benchmark which will make `contains()` perform worse than it really does, due to it being the first to be tested and there's no warm up for the JVM. – Kayaman Aug 19 '16 at 09:45
This answer is one of the most useful thing i have read recently – Luca Tampellini Aug 16 '18 at 15:42
I've made a more updated answer using what you guys wrote, and wrote in Kotlin for Android. It should work with similar results as on PC with Java, of course. Link: https://stackoverflow.com/a/57508798/878126 – android developer Aug 15 '19 at 11:13
The tests may not be accurate due to a lot of effects removing unused code, the JIT compiler/optimizer. https://openjdk.java.net/projects/code-tools/jmh/ the author of this tool has many write ups and presentations explaining this. It is cool, it is deep stuff too. – fergjo Apr 20 '20 at 01:47

score 2 · Answer 4 · edited Jul 26 '22 at 10:26

2

If you are searching a large amount of Strings I've read the Aho-Corasick algorithm is pretty fast, but it's a natively implemented in Java. It's the same algorithm used by GREP in Unix-based systems if that helps and it's pretty efficient. Here is a Java implementation courtesy of Berkley.

See also: https://stackoverflow.com/a/1765616/59087

edited Jul 26 '22 at 10:26

Oleg Ushakov

1,200
14
27

answered Aug 20 '13 at 16:28

Skylion

2,696
26
50

score 1 · Answer 5 · answered Aug 20 '13 at 16:28

From the example in your question, I assume you want to do case insensitive comparisons. Those slow down the process considerably. Hence, if you can live with some inaccuracies - which might depend on the locale in which you need to do the comparison, and your long text is searched again and again, it might make sense to convert the long text one time to uppercase, and the search string as well, and then search case-insensitive.

Veronica Cornejo · Answer 6 · 2013-08-20T16:32:43.410

This is dependent on specific JRE (and even JDK) make/version. It also depends / could depend on factors as string length, probability of being contained, in what position, etc. The only way of obtaining precise performance data requires setting up your exact context.

However, in general aString.contains() and aString.indexOf() should be exactly the same. And even if a regular expression was superbly optimized, it wouldn't exceed the performance of the first two.

No, Java doesn't use extremely specialized algorithms either.

score 0 · Answer 7 · answered Aug 15 '19 at 11:12

A benchmark in Kotlin (which uses Java anyway, so results are about the same), on Android, using similar approach as stated above, shows that indeed contains is similar to indexOf, but for some reason faster, even though it uses it.

As for regular expressions, because it creates real objects, and allows to go further, it's slower.

Sample results (time in ms) :

Contains: 0
IndexOf: 5
Matches: 45

Code:

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        AsyncTask.execute {
            val itemsCount = 1000
            val minStringLength = 1000
            val maxStringLength = 1000
            val list = ArrayList<String>(itemsCount)
            val r = Random()
            val stringToSearchFor = prepareFakeString(r, 5, 10, ALPHABET_LOWERCASE + ALPHABET_UPPERCASE + DIGITS)
            for (i in 0 until itemsCount)
                list.add(prepareFakeString(r, minStringLength, maxStringLength, ALPHABET_LOWERCASE + ALPHABET_UPPERCASE + DIGITS))
            val resultsContains = ArrayList<Boolean>(itemsCount)
            val resultsIndexOf = ArrayList<Boolean>(itemsCount)
            val resultsRegEx = ArrayList<Boolean>(itemsCount)
            //Contains
            var start: Long = System.currentTimeMillis()
            var stop: Long = System.currentTimeMillis()
            for (str in list) {
                resultsContains.add(str.contains(stringToSearchFor))
            }
            Log.d("AppLog", "Contains: " + (stop - start))
            //IndexOf
            start = System.currentTimeMillis()
            for (str in list) {
                resultsIndexOf.add(str.indexOf(stringToSearchFor) >= 0)
            }
            stop = System.currentTimeMillis()
            Log.d("AppLog", "IndexOf: " + (stop - start))
            //Matches
            val regex = stringToSearchFor.toRegex()
            start = System.currentTimeMillis()
            for (str in list) {
                resultsRegEx.add(regex.find(str) != null)
            }
            stop = System.currentTimeMillis()
            Log.d("AppLog", "Matches: " + (stop - start))
            Log.d("AppLog", "checking results...")
            var foundIssue = false
            for (i in 0 until itemsCount) {
                if (resultsContains[i] != resultsIndexOf[i] || resultsContains[i] != resultsRegEx[i]) {
                    foundIssue = true
                    break
                }
            }
            Log.d("AppLog", "done. All results are the same?${!foundIssue}")
        }
    }

    companion object {
        const val ALPHABET_LOWERCASE = "qwertyuiopasdfghjklzxcvbnm"
        const val ALPHABET_UPPERCASE = "QWERTYUIOPASDFGHJKLZXCVBNM"
        const val DIGITS = "1234567890"

        fun prepareFakeString(r: Random, minLength: Int, maxLength: Int, charactersToChooseFrom: String): String {
            val length = if (maxLength == minLength) maxLength else r.nextInt(maxLength - minLength) + minLength
            val sb = StringBuilder(length)
            for (i in 0 until length)
                sb.append(charactersToChooseFrom[r.nextInt(charactersToChooseFrom.length)])
            return sb.toString()
        }
    }
}

what is the fastest substring search method in Java

7 Answers7

Linked

Related