1

The answer probably exists somewhere but I can't find it. I came to this question from an algorithm I am creating. Essentially a .contains(String s1, String s2) that returns true if s1 contains s2, ignoring Greek/English character difference. For example the string 'nai, of course' contains the string 'ναι'. However, this is kindly irrelevant to my question.

The contains() method of a String uses the naive approach and I use the same for my algorithm. What contains() essentially does, is to call the static indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset, int targetCount, int fromIndex) which exists in java.lang.String.class with the correct parameters.

While I was doing different kind of benchmarking and tests to my algorithm, I removed all the Greek - English logic to see how fast it behaves with only English strings. And it was slower. About 2 times slower than a s1.contains(s2) that comes from JDK.

So, I took the time to copy and paste this indexOf method to my class, and call it 15 million times in the same way JDK calls it for a string.

The class is the following:

public class TestIndex {
    public static int fromJdk;
    public static int fromMe;

    public static void main(String[] args) {
        String s1 = "papapatita";
        String s2 = "patita";
        final char[] s1Arr = s1.toCharArray();
        int s1Length = s1Arr.length;

        final char[] s2Arr = s2.toCharArray();
        int s2Length = s2Arr.length;

        long be4 = System.nanoTime();
        for (int i = 0; i < 15_000_000; i++) {
            fromJdk = s1.indexOf(s2);
            //          fromMe = indexOf(s1Arr, 0, s1Length, s2Arr, 0, s2Length, 0);
        }
        long after = System.nanoTime();
        System.out.println((after - be4) / 1000000.0);
    }

    static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset,
            int targetCount, int fromIndex) {
        if (fromIndex >= sourceCount) {
            return (targetCount == 0 ? sourceCount : -1);
        }
        if (fromIndex < 0) {
            fromIndex = 0;
        }
        if (targetCount == 0) {
            return fromIndex;
        }

        char first = target[targetOffset];
        int max = sourceOffset + (sourceCount - targetCount);

        for (int i = sourceOffset + fromIndex; i <= max; i++) {
            /* Look for first character. */
            if (source[i] != first) {
                while (++i <= max && source[i] != first)
                    ;
            }

            /* Found first character, now look at the rest of v2 */
            if (i <= max) {
                int j = i + 1;
                int end = j + targetCount - 1;
                for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                    ;

                if (j == end) {
                    /* Found whole string. */
                    return i - sourceOffset;
                }
            }
        }
        return -1;
    }
}

While the fromJdk = s1.indexOf(s2); is without comment, the result is (in my machine) about ~150ms. If I comment this line and leave the fromMe = indexOf(s1Arr, 0, s1Length, s2Arr, 0, s2Length, 0); without comment, I should get the same result. The result is almost the double time. About ~300ms.

So, the question is why? The indexOf method is exactly as it exists in JDK. I know that measuring performance in Java is a hard thing to do (probably JMH)? But the difference is big. Does indexOf from JDK gets a special treatment under the hood?

If the answer somewhere exists, point me to it.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
George Z.
  • 6,643
  • 4
  • 27
  • 47
  • Which Java version are you using? – M A Feb 17 '21 at 16:27
  • Does this answer your question? [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – luk2302 Feb 17 '21 at 16:28
  • 150 vs 300 ms is not a big difference, 150 ms vs. 15s would be a big difference. A factor of 2 is meaningless. – luk2302 Feb 17 '21 at 16:29
  • @MAnouti Java 8. – George Z. Feb 17 '21 at 16:29
  • @luk2302 You are right, I know. But the point of this question is not to reduce the difference. The question is why? Why does this happen? I run the same code. Also, about benchmarking, the 150 and 300 is big enough to see it with your eyes. If the difference would be nano seconds or something likes this, yes, I would go with JMH and all the advices that exist in the answer you linked. With other words, your proposed answer, does not answer my question. – George Z. Feb 17 '21 at 16:33
  • 2
    @GeorgeZ. What Kayaman answered, and I would add if you look at the source code of Java 11 (latest LTS version), you can see an annotation called `@HotSpotIntrinsicCandidate` that indicates the JVM may replace the method with machine code that is more efficient. I would imagine this behavior was present in Java 8 and most likely before that too. – M A Feb 17 '21 at 16:39
  • 1
    @MAnouti Nice to hear that as well. Thanks for info. I will keep it in mind. – George Z. Feb 17 '21 at 16:43
  • 1
    The same problem as [here](https://stackoverflow.com/questions/45912510/does-java-jit-cheat-when-running-jdk-code) – apangin Feb 17 '21 at 17:44

1 Answers1

4

Because String.indexOf() is an intrinsic method, making the JDK call a native implementation and your call a Java implementation.

The JVM doesn't actually execute the Java code you see, it knows that it can replace it with a far more efficient version. When you copy the code it goes through regular JIT compilation, making it less efficient. Just one of the dozens of tricks the JVM does to make things more performant, without the developer often even realizing.

Kayaman
  • 72,141
  • 5
  • 83
  • 121