What is the Big-O of String.contains() in Java?

Question

I'm working on a project, and need to optimize the running time. Is String.contains() runtime the same as TreeSet.contains(), which is O(logN)?

The reason I'm asking is I'm building a TreeMap<String, TreeSet<Song>>, where Songs contain a String of lyrics. Depending on the efficiency, I am considering including a Set of the lyric words inside the Song and running searches on that rather than the String.

Not trying to be a jerk or anything but: Why not profile it? — Belmin Fernandez, Nov 04 '10 at 01:38
If I have the time for tests, maybe. There is another test I want to run with the project: runtime variations between treeset and hashset. If there were 30 hrs in a day, there still wouldn't be enough time for everything! — Jason, Nov 04 '10 at 02:20

score 36 · Accepted Answer · edited Jan 10 '17 at 23:42

36

One of the best known algorithms is the Boyer-Moore string searching algorithm which is O(n) although it can give sublinear performance in the best case.

Which algorithm is used in Java depends on which implemetation you download. It seems that for example OpenJDK uses a naive algorithm that runs in O(nm) and linear performance in the best case. See lines 1770-1806 here.

edited Jan 10 '17 at 23:42

NickL

4,258
2
21
38

answered Nov 03 '10 at 17:05

Mark Byers

811,555
193
1,581
1,452

3

the article you linked to said it's O(n) as it makes at most 3n comparisons. "worst case O(n)" is a tautology - by definition O(n) is the worst case :) – Nicholas White Nov 03 '10 at 17:27
The jdk1.6.0_23 had the same `String.indexOf()` implementation as a contemporary OpenJDK, according to https://programmers.stackexchange.com/questions/65558/why-does-javas-string-class-not-implement-a-more-efficient-indexof . Someone could tell you if that's true for `String.contains()` – rakslice Jul 09 '14 at 21:18
5

@NicholasWhite O(n) is an *upper bound*. It could be an upper bound of the worst case, average case, or best case performance. Upper and lower bound are orthogonal to best/average/worst case. Another orthogonal dimension is *loose* vs. *tight*. O(n) is a loose upper bound. Θ(n) is a tight upper and lower bound. – John Kugelman Jan 10 '17 at 22:25

score 6 · Answer 2 · answered Apr 05 '20 at 16:27

6

.contains() definitely uses naive approach and equivalent to O(nm) time complexity.

Boyer-moore takes O(nm) time in the worst case.
KMP takes O(n) time in the worst case.

In one of the problems related to pattern matching, I used .contains() and it took 70 ms whereas replacing that line with patternSearch() //KMP search reduced the time to 14 ms.

Java source code | KMP search vs .contains()

answered Apr 05 '20 at 16:27

Nishant Thapliyal

1,540
17
28

KMP worst case is O(n + m) since you need to create the lsp table beforehand. – sjkm Oct 20 '20 at 08:11
True but even in worst-case where `n==m` -> O(n+m) -> O(2n) -> O(n) – Nishant Thapliyal Oct 21 '20 at 06:04

score 0 · Answer 3 · answered Nov 03 '10 at 17:36

0

You could also try using a Trie instead of a TreeMap (although Java has no built-in Trie implementation)

answered Nov 03 '10 at 17:36

dty

18,795
6
56
82

[Trie reference](https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/) – cellepo Oct 22 '15 at 15:26
1

Seems Trie only supports "starts with", not "contains"? – coderz May 06 '19 at 17:25

What is the Big-O of String.contains() in Java?

3 Answers3

Linked