Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8
is better than String(byte[] bytes, int offset, int length, Charset charset)
for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen"
.
I tried to figure out why, so I copied the relevant code in String
and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8
.
Here are the JMH benchmark results:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 127.107 ± 3.642 ns/op
StringBenchmark.unsafeDecoding avgt 10 100.915 ± 4.090 ns/op
I assume the difference is due to missing bounds checking elimination which I expected to kick in, especially since there is an explicit bounds check in the form of a call to checkBoundsOffCount(offset, length, bytes.length)
in the beginning of String(byte[] bytes, int offset, int length, Charset charset)
.
Is the issue really a missing bounds check elimination?
Here's the code I benchmarked using OpenJDK 17 & JMH. Note that this is only part of the String(byte[] bytes, int offset, int length, Charset charset)
constructor code, and works correctly only for this specific German String.
The static methods were copied from String
.
Look for the // the unsafe version:
comments that indicate where I replaced the safe access with unsafe.
private static byte[] safeDecode(byte[] bytes, int offset, int length) {
checkBoundsOffCount(offset, length, bytes.length);
int sl = offset + length;
int dp = 0;
byte[] dst = new byte[length];
while (offset < sl) {
int b1 = bytes[offset];
// the unsafe version:
// int b1 = UnsafeUtil.getByte(bytes, offset);
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++;
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
// the unsafe version:
// int b2 = UnsafeUtil.getByte(bytes, offset + 1);
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
if (offset == sl) {
if (dp != dst.length) {
dst = Arrays.copyOf(dst, dp);
}
return dst;
}
return dst;
}
Followup
Apparently if I change the while loop condition from offset < sl
to 0 <= offset && offset < sl
I get similar performance in both versions:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 100.802 ± 13.147 ns/op
StringBenchmark.unsafeDecoding avgt 10 102.774 ± 3.893 ns/op
Conclusion
This question was picked up by HotSpot developers as https://bugs.openjdk.java.net/browse/JDK-8278518.
Optimizing this code ended up giving a 2.5x boost to decoding the above Latin1 string.
This C2 optimization closes the unbelievable more than 7x gap between commonBranchFirst
and commonBranchSecond
in the below benchmark and will land in Java 19.
Benchmark Mode Cnt Score Error Units
LoopBenchmark.commonBranchFirst avgt 25 1737.111 ± 56.526 ns/op
LoopBenchmark.commonBranchSecond avgt 25 232.798 ± 12.676 ns/op
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LoopBenchmark {
private final boolean[] mostlyTrue = new boolean[1000];
@Setup
public void setup() {
for (int i = 0; i < mostlyTrue.length; i++) {
mostlyTrue[i] = i % 100 > 0;
}
}
@Benchmark
public int commonBranchFirst() {
int i = 0;
while (i < mostlyTrue.length) {
if (mostlyTrue[i]) {
i++;
} else {
i += 2;
}
}
return i;
}
@Benchmark
public int commonBranchSecond() {
int i = 0;
while (i < mostlyTrue.length) {
if (!mostlyTrue[i]) {
i += 2;
} else {
i++;
}
}
return i;
}
}