11

Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8 is better than String(byte[] bytes, int offset, int length, Charset charset) for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen".

I tried to figure out why, so I copied the relevant code in String and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8. Here are the JMH benchmark results:

Benchmark                       Mode  Cnt    Score   Error  Units
StringBenchmark.safeDecoding    avgt   10  127.107 ± 3.642  ns/op
StringBenchmark.unsafeDecoding  avgt   10  100.915 ± 4.090  ns/op

I assume the difference is due to missing bounds checking elimination which I expected to kick in, especially since there is an explicit bounds check in the form of a call to checkBoundsOffCount(offset, length, bytes.length) in the beginning of String(byte[] bytes, int offset, int length, Charset charset).

Is the issue really a missing bounds check elimination?

Here's the code I benchmarked using OpenJDK 17 & JMH. Note that this is only part of the String(byte[] bytes, int offset, int length, Charset charset) constructor code, and works correctly only for this specific German String. The static methods were copied from String. Look for the // the unsafe version: comments that indicate where I replaced the safe access with unsafe.

    private static byte[] safeDecode(byte[] bytes, int offset, int length) {
        checkBoundsOffCount(offset, length, bytes.length);
        int sl = offset + length;
        int dp = 0;
        byte[] dst = new byte[length];
        while (offset < sl) {
            int b1 = bytes[offset];
            // the unsafe version:
            // int b1 = UnsafeUtil.getByte(bytes, offset);
            if (b1 >= 0) {
                dst[dp++] = (byte)b1;
                offset++;
                continue;
            }
            if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
                    offset + 1 < sl) {
                // the unsafe version:
                // int b2 = UnsafeUtil.getByte(bytes, offset + 1);
                int b2 = bytes[offset + 1];
                if (!isNotContinuation(b2)) {
                    dst[dp++] = (byte)decode2(b1, b2);
                    offset += 2;
                    continue;
                }
            }
            // anything not a latin1, including the repl
            // we have to go with the utf16
            break;
        }
        if (offset == sl) {
            if (dp != dst.length) {
                dst = Arrays.copyOf(dst, dp);
            }
            return dst;
        }

        return dst;
    }

Followup

Apparently if I change the while loop condition from offset < sl to 0 <= offset && offset < sl I get similar performance in both versions:

Benchmark                       Mode  Cnt    Score    Error  Units
StringBenchmark.safeDecoding    avgt   10  100.802 ± 13.147  ns/op
StringBenchmark.unsafeDecoding  avgt   10  102.774 ± 3.893  ns/op

Conclusion

This question was picked up by HotSpot developers as https://bugs.openjdk.java.net/browse/JDK-8278518.

Optimizing this code ended up giving a 2.5x boost to decoding the above Latin1 string.

This C2 optimization closes the unbelievable more than 7x gap between commonBranchFirst and commonBranchSecond in the below benchmark and will land in Java 19.

Benchmark                         Mode  Cnt     Score    Error  Units
LoopBenchmark.commonBranchFirst   avgt   25  1737.111 ± 56.526  ns/op
LoopBenchmark.commonBranchSecond  avgt   25   232.798 ± 12.676  ns/op
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LoopBenchmark {

  private final boolean[] mostlyTrue = new boolean[1000];

  @Setup
  public void setup() {
    for (int i = 0; i < mostlyTrue.length; i++) {
      mostlyTrue[i] = i % 100 > 0;
    }
  }

  @Benchmark
  public int commonBranchFirst() {
    int i = 0;
    while (i < mostlyTrue.length) {
      if (mostlyTrue[i]) {
        i++;
      } else {
        i += 2;
      }
    }
    return i;
  }

  @Benchmark
  public int commonBranchSecond() {
    int i = 0;
    while (i < mostlyTrue.length) {
      if (!mostlyTrue[i]) {
        i += 2;
      } else {
        i++;
      }
    }
    return i;
  }
}
Amir Hadadi
  • 421
  • 4
  • 13
  • Interesting, I always believed JIT would optimize bounds check, especially providing that `bytes` and `offset` are not modified within the method. I think you should use LinuxPerfAsmProfiler to see executed assembly – Sergey Tsypanov Dec 08 '21 at 16:05
  • I printed the generated assembly, but I can't make much of it.. My main point is that I expect UTF8 decoding optimized by HotSpot to be at least as fast as protobuf (or other 3rd party alternatives for that matter). Currently only decoding strings containing pure ascii is highly optimized, but the case of a non ascii latin1 string can be improved by a big margin. – Amir Hadadi Dec 09 '21 at 10:19
  • I'try to test `0 <= offset && offset < sl` on JDK code base with provided String and let you know on the results – Sergey Tsypanov Dec 09 '21 at 13:22
  • I've answered the question below, see the answer with links. You were right – Sergey Tsypanov Dec 10 '21 at 13:43

1 Answers1

3

To measure the branch you are interested in and particularly the scenario when while loop becomes hot, I've used the following benchmark:

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConstructorBenchmark {
  private byte[] array;

  @Setup
  public void setup() {
    String str = "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen. Я";
    array = str.getBytes(StandardCharsets.UTF_8);
  }

  @Benchmark
  public String newString()  {
      return new String(array, 0, array.length, StandardCharsets.UTF_8);
  }
}

And indeed, with modified constructor it does give significant improvement:

//baseline
Benchmark                             Mode  Cnt    Score   Error  Units
StringConstructorBenchmark.newString  avgt   50  173,092 ± 3,048  ns/op

//patched
Benchmark                             Mode  Cnt    Score   Error  Units
StringConstructorBenchmark.newString  avgt   50  126,908 ± 2,355  ns/op

This is likely to be a HotSpot issue: optimizing compiler for some reason failed to eliminate array bounds check within while-loop. I guess the reason is that offset is modified within the loop:

while (offset < sl) {
    int b1 = bytes[offset];
    if (b1 >= 0) {
        dst[dp++] = (byte)b1;
        offset++;                     // <---
        continue;
    }
    if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
            offset + 1 < sl) {
        int b2 = bytes[offset + 1];
        if (!isNotContinuation(b2)) {
            dst[dp++] = (byte)decode2(b1, b2);
            offset += 2;
            continue;
        }
    }
    // anything not a latin1, including the repl
    // we have to go with the utf16
    break;
}

Also I've looked into the code via LinuxPerfAsmProfiler, here's the link for baseline https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6 and this one is for patched constructor https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7

What should one pay attention to? Let's find the code corresponding int b1 = bytes[offset]; (line 538). In baseline we have this:

  3.62%           ││            │  0x00007fed70eb4c1c:   mov    %ebx,%ecx
  2.29%           ││            │  0x00007fed70eb4c1e:   mov    %edx,%r9d
  2.22%           ││            │  0x00007fed70eb4c21:   mov    (%rsp),%r8                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
                  ││            │                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)
  2.32%           ↘│            │  0x00007fed70eb4c25:   cmp    %r13d,%ecx
                   │            │  0x00007fed70eb4c28:   jge    0x00007fed70eb5388           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                   │            │                                                            ; - java.lang.String::&lt;init&gt;@110 (line 537)
  3.05%            │            │  0x00007fed70eb4c2e:   cmp    0x8(%rsp),%ecx
                   │            │  0x00007fed70eb4c32:   jae    0x00007fed70eb5319
  2.38%            │            │  0x00007fed70eb4c38:   mov    %r8,(%rsp)
  2.64%            │            │  0x00007fed70eb4c3c:   movslq %ecx,%r8
  2.46%            │            │  0x00007fed70eb4c3f:   mov    %rax,%rbx
  3.44%            │            │  0x00007fed70eb4c42:   sub    %r8,%rbx
  2.62%            │            │  0x00007fed70eb4c45:   add    $0x1,%rbx
  2.64%            │            │  0x00007fed70eb4c49:   and    $0xfffffffffffffffe,%rbx
  2.30%            │            │  0x00007fed70eb4c4d:   mov    %ebx,%r8d
  3.08%            │            │  0x00007fed70eb4c50:   add    %ecx,%r8d
  2.55%            │            │  0x00007fed70eb4c53:   movslq %r8d,%r8
  2.45%            │            │  0x00007fed70eb4c56:   add    $0xfffffffffffffffe,%r8
  2.13%            │            │  0x00007fed70eb4c5a:   cmp    (%rsp),%r8
                   │            │  0x00007fed70eb4c5e:   jae    0x00007fed70eb5319
  3.36%            │            │  0x00007fed70eb4c64:   mov    %ecx,%edi                    ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                   │            │                                                            ; - java.lang.String::&lt;init&gt;@113 (line 538)
  2.86%            │           ↗│  0x00007fed70eb4c66:   movsbl 0x10(%r14,%rdi,1),%r8d       ;*baload {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@115 (line 538)
  2.48%            │           ││  0x00007fed70eb4c6c:   mov    %r9d,%edx
  2.26%            │           ││  0x00007fed70eb4c6f:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@127 (line 540)
  3.28%            │           ││  0x00007fed70eb4c71:   mov    %edi,%ebx
  2.44%            │           ││  0x00007fed70eb4c73:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │           ││                                                            ; - java.lang.String::&lt;init&gt;@134 (line 541)
  2.35%            │           ││  0x00007fed70eb4c75:   test   %r8d,%r8d
                   ╰           ││  0x00007fed70eb4c78:   jge    0x00007fed70eb4c04           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                               ││                                                            ; - java.lang.String::&lt;init&gt;@120 (line 539)

and in patched code the corresponding part is

 17.28%           ││  0x00007f6b88eb6061:   mov    %edx,%r10d                   ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
                  ││                                                            ; - java.lang.String::&lt;init&gt;@107 (line 537)
  0.11%           ↘│  0x00007f6b88eb6064:   test   %r10d,%r10d
                   │  0x00007f6b88eb6067:   jl     0x00007f6b88eb669c           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@108 (line 537)
  0.39%            │  0x00007f6b88eb606d:   cmp    %r13d,%r10d
                   │  0x00007f6b88eb6070:   jge    0x00007f6b88eb66d0           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@114 (line 537)
  0.66%            │  0x00007f6b88eb6076:   mov    %ebx,%r9d
 13.70%            │  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d
  0.01%            │  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671
  0.14%            │  0x00007f6b88eb6084:   movsbl 0x10(%r14,%r10,1),%edi       ;*baload {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@119 (line 538)
  0.37%            │  0x00007f6b88eb608a:   mov    %r9d,%ebx
  0.99%            │  0x00007f6b88eb608d:   inc    %ebx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@131 (line 540)
 12.88%            │  0x00007f6b88eb608f:   movslq %r9d,%rsi                    ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@196 (line 548)
  0.17%            │  0x00007f6b88eb6092:   mov    %r10d,%edx
  0.39%            │  0x00007f6b88eb6095:   inc    %edx                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@138 (line 541)
  0.96%            │  0x00007f6b88eb6097:   test   %edi,%edi
  0.02%            │  0x00007f6b88eb6099:   jl     0x00007f6b88eb60dc           ;*iflt {reexecute=0 rethrow=0 return_oop=0}
                   │                                                            ; - java.lang.String::&lt;init&gt;@124 (line 539)

In baseline between if_icmpge and aload_1 byte-code instructions we have bounds check, but we don't have one in patched code.

So your original assumptions is correct: it is about missing bounds check elimination.

UPD I must correct my answer: it turned out that bounds check is still there:

13.70%            │  0x00007f6b88eb6079:   cmp    0x8(%rsp),%r10d
 0.01%            │  0x00007f6b88eb607e:   jae    0x00007f6b88eb6671

and the code I've pointed out is something that the compiler introduces, but it does nothing. The issue itself is still about bounds check as its explicit declaration solves the issue ad hoc.

Sergey Tsypanov
  • 3,265
  • 3
  • 8
  • 34
  • 1
    Thanks for investigating this Sergey. Note that the string I used is latin1 but not ascii, so it doesn’t enter the branch you claimed it enters. No need to modify it by adding a Russian character. The branch you referred to is only for ascii strings (all bytes are non negative). Also, can you highlight the relevant assembly parts? – Amir Hadadi Dec 10 '21 at 15:17
  • 1
    The relevant part starts with 3.05% │ │ 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx Also I've filed the issue for this https://bugs.openjdk.java.net/browse/JDK-8278518. It looks like the same behaviour is observed in other code snippets like this one. If you don't mind I'd like to create a PR for this as well. – Sergey Tsypanov Dec 10 '21 at 16:08
  • Can you fix the parts about the string I gave triggering the ascii code path? Than I’ll accept your answer. Also contrary to what you wrote in the bug report I don’t suggest this 0 <= offset condition as the right fix to this issue. I think the right fix is either getting c2 smarter about this situation, or thinking much more radically, totally eliminating bounds check for code that is deemed safe in the jdk source code (which may simply be all the jdk source code). But I suppose that is highly debatable. – Amir Hadadi Dec 10 '21 at 17:19
  • 1
    I've removed that part from the answer. I've also reworded the ticket. I totally agree with you about the point that this is a compiler issue and the proper fix must be done there. We could suggest Java fix along with full explanation of the issue and let Oracle team decide on this. Btw, the similar behaviour is observed in `String.translateEscapes()`. – Sergey Tsypanov Dec 10 '21 at 19:04