The answer probably exists somewhere but I can't find it. I came to this question from an algorithm I am creating. Essentially a .contains(String s1, String s2)
that returns true if s1 contains s2, ignoring Greek/English character difference. For example the string 'nai, of course' contains the string 'ναι'. However, this is kindly irrelevant to my question.
The contains()
method of a String
uses the naive approach and I use the same for my algorithm. What contains()
essentially does, is to call the static
indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset, int targetCount, int fromIndex)
which exists in java.lang.String.class
with the correct parameters.
While I was doing different kind of benchmarking and tests to my algorithm, I removed all the Greek - English logic to see how fast it behaves with only English strings. And it was slower. About 2 times slower than a s1.contains(s2)
that comes from JDK.
So, I took the time to copy and paste this indexOf
method to my class, and call it 15 million times in the same way JDK calls it for a string.
The class is the following:
public class TestIndex {
public static int fromJdk;
public static int fromMe;
public static void main(String[] args) {
String s1 = "papapatita";
String s2 = "patita";
final char[] s1Arr = s1.toCharArray();
int s1Length = s1Arr.length;
final char[] s2Arr = s2.toCharArray();
int s2Length = s2Arr.length;
long be4 = System.nanoTime();
for (int i = 0; i < 15_000_000; i++) {
fromJdk = s1.indexOf(s2);
// fromMe = indexOf(s1Arr, 0, s1Length, s2Arr, 0, s2Length, 0);
}
long after = System.nanoTime();
System.out.println((after - be4) / 1000000.0);
}
static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,
int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
char first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
}
While the fromJdk = s1.indexOf(s2);
is without comment, the result is (in my machine) about ~150ms. If I comment this line and leave the fromMe = indexOf(s1Arr, 0, s1Length, s2Arr, 0, s2Length, 0);
without comment, I should get the same result. The result is almost the double time. About ~300ms.
So, the question is why? The indexOf
method is exactly as it exists in JDK. I know that measuring performance in Java is a hard thing to do (probably JMH)? But the difference is big. Does indexOf
from JDK gets a special treatment under the hood?
If the answer somewhere exists, point me to it.