Why is the java vector API so slow compared to scalar?

Question

I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.

Here's my code:

public class SIMDParse {

private static IntVector mul = IntVector.fromArray(
        IntVector.SPECIES_512,
        new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
        0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;

static {
    filler = new byte[16];
    for (int i = 0; i < 16; i++) {
        filler[i] = zeroChar;
    }
}

public static int parseInt(String str) {
    boolean negative = str.charAt(0) == '-';
    byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
    if (negative) {
        bytes[0] = zeroChar;
    }
    bytes = ensureSize(bytes, width);
    ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
    vec = vec.sub(zeroChar);
    IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
    ints = ints.mul(mul);
    return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}

public static byte[] ensureSize(byte[] arr, int per) {
    int mod = arr.length % per;
    if (mod == 0) {
        return arr;
    }
    int length = arr.length - (mod);
    length += per;
    byte[] newArr = new byte[length];
    System.arraycopy(arr, 0, newArr, per - mod, arr.length);
    System.arraycopy(filler, 0, newArr, 0, per - mod);
    return newArr;
}

public static byte[] ensureSize2(byte[] arr, int per) {
    int mod = arr.length % per;
    if (mod == 0) {
        return arr;
    }
    int length = arr.length - (mod);
    length += per;
    byte[] newArr = new byte[length];
    System.arraycopy(arr, 0, newArr, 0, arr.length);
    return newArr;
}

public static int indexOf(String s, char c) {
    byte[] b = s.getBytes(StandardCharsets.UTF_8);
    int width = ByteVector.SPECIES_MAX.length();
    byte bChar = (byte) c;
    b = ensureSize2(b, width);
    for (int i = 0; i < b.length; i += width) {
        ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
        int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
        if (pos != width) {
            return pos + i;
        }
    }
    return -1;
}

}

I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).

By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.

Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.

What hardware (and JVM version) did you test on? Also, you should update this with info from your comments on my answer, that apparently the long-string test was only one repeat. — Peter Cordes, Jun 21 '21 at 03:35

Peter Cordes · Answer 1 · 2023-03-27T15:57:48.333

You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.

I'm assuming your times are totals (for 10k repeats each), not a per-call average.

7 us is impossibly fast for that.

"____".repeat(4000) is 16k chars (32k bytes) before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled wmemchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 1250 us for 10k reps. (32000B / (64B/c) * 10000 reps / 4000 MHz), assuming that 32kB string stayed hot in 32KiB L1d cache.

I'd hope and expect a JVM would either call the native wmemchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling. (Java strings are actually UTF-16, but C on Linux wchar_t is 4 bytes wide, unlike Windows, so JVMs would need their own implementation.)

Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)

See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.

But you say 7 us is the per-iteration time? That would be improbably slow, except for a non-optimized first pass. So probably a sign of faulty benchmarking methodolody, like lack of warm-up runs.

If IndexOf was checking one char at a time, 16k * 0.25 ns/char would be 4000 nanoseconds, or 4 microseconds on a 4GHz CPU. 7 us is in that ballpark of checking 1 char per cycle, which is pathetically slow on a modern x86. I think it's unlikely that mainstream JVMs would use such a slow implementation once they were done JIT optimizing.

Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.

A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.

So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.

16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.

What SIMD is good for

for tasks like these.

No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.

Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)

I should have clarified - it was only 10k calls for the int parsing. For indexOf, each was timed for a single call. — Redempt, Jun 21 '21 at 03:08
Just wrote a basic test with a method for scalar multiply and vector multiply on an array - it seems the runtime is nearly identical for both small and large arrays. both taking about 9ms on an array size 2560000. https://hastebin.com/pexakenihe.java — Redempt, Jun 21 '21 at 03:21
@Redempt: Oh, then 7 us for *one* call is pathetic. Maybe HotSpot doesn't handle indexOf specially? But IDK why / how one call to your manual intrinsics version could take so long either, unless your benchmark methodology is totally flawed and you're missing a warm-up run. I'd suggest using a repeat loop. — Peter Cordes, Jun 21 '21 at 03:30
@Redempt: `out[i] = arr[i] * scalar;` - That's *so* easy to auto-vectorize that I think modern HotSpot will actually do it for you while JITting. ([Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?](https://stackoverflow.com/a/20267515) says it's been in since Java 7u40) At least that's good confirmation that the SIMD API isn't making things worse when used for normal vertical-SIMD things over arrays. (Or at least that both ways bottleneck on memory bandwidth.) — Peter Cordes, Jun 21 '21 at 03:32
Update: I think when I was writing this answer, I forgot that Java Strings used UTF-16, so that's twice the cache footprint. 16k characters won't fit in a 32KiB L1d cache, so it might be running at the bandwidth of L2 cache. But should still be vastly faster. If `IndexOf` was checking one char at a time, 16k * 0.25 ns/char would be 4000 nanoseconds, or 4 microseoncds on a 4GHz CPU. 7 us is in that ballpark of checking 1 char per cycle, which is pathetically slow on a modern x86. — Peter Cordes, Mar 27 '23 at 15:38

Tim V · Answer 2 · 2023-03-27T15:13:04.640

This is an updated version of the post I originally put up on this subject on 29-Jan-2022. I'm very much obliged to @JimmyB and @rapaio for pointing out some major flaws there, and think I've addressed them now. Also, I'm in a position to compare Java 19 with Java 20.

I found this post because I thought I had hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays. As updated (after correcting errors) the main routine is this (for the Vector case):

  static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
    long start = System.nanoTime();
    for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
      DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
      DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
      va.mul(vb).intoArray(output, i);
    }
    long finish = System.nanoTime();
    System.out.println("  vector (intoArray) duration (ns)\t" + iteration + "\t" + (finish - start));
  }

and this (for the scalar case):

  static private void doScalar(int iteration, double[] input1, double[] input2, double[] output) {
    long start = System.nanoTime();
    for (int i = 0; i < ARRAY_LENGTH; ++i) {
      output[i] = input1[i] * input2[i];
    }
    long finish = System.nanoTime();
    System.out.println("  scalar duration (ns)\t" + iteration + "\t" + (finish - start));
  }

For testing I'm using an array length of 65536 (random numbers) and 1024 iterations. The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).

The timings take a while to settle down. Firstly, the first 2 scalar, and 6 Vector iterations are considerably (approx 10 times) slower than the later ones, and secondly, the first 180 or so iterations are very erratic in timing. There are occasional spikes thereafter (presumably to do with other stuff going on on the machine/jvm)

If I compare the median times (using median rather than mean to avoid outliers, and ignoring the first 200 iterations to remove the startup effects), in Java 19 the Vector method is roughly 8% faster than the scalar method, and in Java 20 roughly 9% faster. Which all means, I'm still not seeing "four times faster". I wonder whether the optimizer is distributing the scalar calculations... (or maybe I'm doing something else wrong this time :-) )

Do _not_ use `Instant.now()` (or `System.currentTimeMillis()`) for timing these things! As you have observed, it tends to have too low a resolution, typically about 10ms; which explains the 'random' results flipping between '0' and '10+'ms. For stuff like this, _always_ use [`System.nanoTime()`](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/lang/System.html#nanoTime()). — JimmyB, Oct 08 '22 at 22:17
Thanks @JimmyB and d'oh of course you're quite correct in what you say, and I've changed to use System.nanoTime(). That has removed the 0 times as you predicted, but I'm still seeing that the vector times are still more than the scalar ones - for iterations 17-32 (i.e., after the "warm up" I mention above) I'm seeing around 280,000 ns for scalar and 430,000 ns for vector :-( (running on jdk 19 now) — Tim V, Oct 10 '22 at 19:08
Your main problem here is line with System.arraycopy which creates a new small array for intermediate result. You should use instead va.mult(vb).intoArray(SPECIES, output, i) — rapaio, Feb 04 '23 at 08:35
Many thanks @rapaio. Looking into the docs I now have va.mul(vb).intoArray(output, i) which is now running at just about the same speed (after a much slower warmup) as the scalar version. I'll post some more detailed results in due course as an update to my original (with the code update, for which i'm much obliged to you). — Tim V, Feb 05 '23 at 16:57
`out[i] = a[i] * b[i]` is so easy to *auto*-vectorize that HotSpot is probably JITing vectorized asm from your scalar code. It just has to check for overlap of either input with the output before entering the loop. [Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?](https://stackoverflow.com/a/20267515) says it's been a feature since Java 7u40, for simple pure-vertical loops. With strict FP math, a reduction like summing an array or a dot-product can't be auto-vectorized because FP addition isn't associative. So you could try that. — Peter Cordes, Mar 27 '23 at 15:45
Also, max turbo for an i7-7700HQ is 3.8GHz. Unless you disabled turbo, it's probably not accurate to say "at 2.8 GHz", unless even a single active thread heats it up enough to limit the clock to "base" frequency of 2.8. — Peter Cordes, Mar 27 '23 at 15:46

score 1 · Answer 3 · edited May 01 '23 at 08:17

Here are some tests I have run, and I get terrible results from the SIMD Vector processes.

Results:
WARNING: Using incubator modules: jdk.incubator.vector
Species[long, 4, S_256_BIT]
Scalar: 112
Vectorized: 415
Species[double, 4, S_256_BIT]
Scalar: 35
Vectorized: 125
Species[int, 8, S_256_BIT]
Scalar: 50
Vectorized: 257
Species[byte, 32, S_256_BIT]
Scalar: 99
Vectorized: 230
Species[float, 8, S_256_BIT]
Scalar: 41
Vectorized: 207

Java Code:

package testingVectorSIMD;

import static org.junit.Assert.assertArrayEquals;

import java.time.Duration;
import java.time.Instant;
import java.util.Random;

import org.junit.jupiter.api.Test;

import jdk.incubator.vector.ByteVector;
import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.LongVector;

public class SpeciesTest {

     
    @Test
    public void doBytes() {

        var spc = ByteVector.SPECIES_PREFERRED;
        System.out.println(spc.toString());

        Random rand = new Random();
        int size = 1000000;
        var a = new byte[size];
        var b = new byte[size];
        for (int ix = 0; ix < size; ix++) {
            byte by = (byte) (rand.nextInt());
            a[ix] = by;
            by = (byte) (rand.nextInt());
            b[ix] = by;
        }

        Instant start = Instant.now();
        var nonresult = new byte[a.length];
        for (int ix = 0; ix < a.length; ix++) {
            nonresult[ix] = (byte) (a[ix] * b[ix]) ;
        }
        Instant stop = Instant.now();
        System.out.println("Scalar: " + Duration.between(start, stop).toMillis());

        var vecresult = new byte[a.length];
        start = Instant.now();
        int ix = 0;
        for (; ix < spc.loopBound(a.length); ix += spc.length()) {
            var va = ByteVector.fromArray(spc, a, ix);
            var vb = ByteVector.fromArray(spc, b, ix);
            var Result = va.mul(vb);
            Result.intoArray(vecresult, ix);
        }
        for (int ix2 = ix; ix2 < a.length; ix2++)
            vecresult[ix2] = (byte) (a[ix2] * b[ix2]);
        stop = Instant.now();
        System.out.println("Vectorized: " + Duration.between(start, stop).toMillis());

        assertArrayEquals(nonresult, vecresult);
         

    }

    @Test
    public void doInt() {

        var spc = IntVector.SPECIES_PREFERRED;
        System.out.println(spc.toString());
        Random rand = new Random();
        int size = 1000000;
        var a = new int[size];
        var b = new int[size];
        for (int ix = 0; ix < size; ix++) {
            int by = rand.nextInt();
            a[ix] = by;
            by = rand.nextInt();
            b[ix] = by;
        }

        Instant start = Instant.now();
        var nonresult = new int[a.length];
        for (int ix = 0; ix < a.length; ix++) {
            nonresult[ix] = a[ix] * b[ix];
        }
        Instant stop = Instant.now();
        System.out.println("Scalar: " + Duration.between(start, stop).toMillis());

        var vecresult = new int[a.length];
        start = Instant.now();
        int ix = 0;
        for (; ix < spc.loopBound(a.length); ix += spc.length()) {
            var va = IntVector.fromArray(spc, a, ix);
            var vb = IntVector.fromArray(spc, b, ix);
            var Result = va.mul(vb);
            Result.intoArray(vecresult, ix);
        }
        for (int ix2 = ix; ix2 < a.length; ix2++)
            vecresult[ix2] = a[ix2] * b[ix2];
        stop = Instant.now();
        System.out.println("Vectorized: " + Duration.between(start, stop).toMillis());
        assertArrayEquals(nonresult, vecresult);

    }

    @Test
    public void doLong() {

        var spc = LongVector.SPECIES_PREFERRED;
        System.out.println(spc.toString());
        Random rand = new Random();
        int size = 1000000;
        var a = new long[size];
        var b = new long[size];
        for (int ix = 0; ix < size; ix++) {
            long by = rand.nextLong();
            a[ix] = by;
            by = rand.nextLong();
            b[ix] = by;
        }

        Instant start = Instant.now();
        var nonresult = new long[a.length];
        for (int ix = 0; ix < a.length; ix++) {
            nonresult[ix] = a[ix] * b[ix];
        }
        Instant stop = Instant.now();
        System.out.println("Scalar: " + Duration.between(start, stop).toMillis());

        var vecresult = new long[a.length];
        start = Instant.now();
        int ix = 0;
        for (; ix < spc.loopBound(a.length); ix += spc.length()) {
            var va = LongVector.fromArray(spc, a, ix);
            var vb = LongVector.fromArray(spc, b, ix);
            var Result = va.mul(vb);
            Result.intoArray(vecresult, ix);
        }
        for (int ix2 = ix; ix2 < a.length; ix2++)
            vecresult[ix2] = a[ix2] * b[ix2];
        stop = Instant.now();
        System.out.println("Vectorized: " + Duration.between(start, stop).toMillis());
        assertArrayEquals(nonresult, vecresult);

    }

    @Test
    public void doFloat() {

        var spc = FloatVector.SPECIES_PREFERRED;
        System.out.println(spc.toString());
        Random rand = new Random();
        int size = 1000000;
        var a = new float[size];
        var b = new float[size];
        for (int ix = 0; ix < size; ix++) {
            float by = rand.nextFloat();
            a[ix] = by;
            by = rand.nextFloat();
            b[ix] = by;
        }

        Instant start = Instant.now();
        var nonresult = new float[a.length];
        for (int ix = 0; ix < a.length; ix++) {
            nonresult[ix] = a[ix] * b[ix];
        }
        Instant stop = Instant.now();
        System.out.println("Scalar: " + Duration.between(start, stop).toMillis());

        var vecresult = new float[a.length];
        start = Instant.now();
        int ix = 0;
        for (; ix < spc.loopBound(a.length); ix += spc.length()) {
            var va = FloatVector.fromArray(spc, a, ix);
            var vb = FloatVector.fromArray(spc, b, ix);
            var Result = va.mul(vb);
            Result.intoArray(vecresult, ix);
        }
        for (int ix2 = ix; ix2 < a.length; ix2++)
            vecresult[ix2] = a[ix2] * b[ix2];
        stop = Instant.now();
        System.out.println("Vectorized: " + Duration.between(start, stop).toMillis());

        assertArrayEquals(nonresult, vecresult, 0);

    }

    @Test
    public void doDouble() {

        var spc = DoubleVector.SPECIES_PREFERRED;
        System.out.println(spc.toString());
        Random rand = new Random();
        int size = 1000000;
        var a = new double[size];
        var b = new double[size];
        for (int ix = 0; ix < size; ix++) {
            double by = rand.nextDouble();
            a[ix] = by;
            by = rand.nextDouble();
            b[ix] = by;
        }

        Instant start = Instant.now();
        var nonresult = new double[a.length];
        for (int ix = 0; ix < a.length; ix++) {
            nonresult[ix] = a[ix] * b[ix];
        }
        Instant stop = Instant.now();
        System.out.println("Scalar: " + Duration.between(start, stop).toMillis());

        var vecresult = new double[a.length];
        start = Instant.now();
        int ix = 0;
        for (; ix < spc.loopBound(a.length); ix += spc.length()) {
            var va = DoubleVector.fromArray(spc, a, ix);
            var vb = DoubleVector.fromArray(spc, b, ix);
            var Result = va.mul(vb);
            Result.intoArray(vecresult, ix);
        }
        for (int ix2 = ix; ix2 < a.length; ix2++)
            vecresult[ix2] = a[ix2] * b[ix2];
        stop = Instant.now();
        System.out.println("Vectorized: " + Duration.between(start, stop).toMillis());
        assertArrayEquals(nonresult, vecresult, 0);

    }

}

Did you look at [the previous answer](https://stackoverflow.com/a/70904331/224132) using simple multiplication? They reported that timing took many warm-up iterations before settling down. Also, comments under that answer indicate `Instant.now()` has very coarse resolution, at least on some systems, like as bad as 10 ms. Also, you're allocating a new output array for every test, and touching it for the first time inside the timed region. So your timings might include page faults, at least until the JVM warms up some and has some dirty memory in its free list. — Peter Cordes, Apr 27 '23 at 10:57
The timing logic is outside the scope of work and shows equivalent/consistent measurements. Where's the solution if we need to wait for the JVM to warm up? — AI Stock Trends With WEKA TA-L, Apr 27 '23 at 18:09
Benchmark methodology is 100% relevant, especially for a language like Java where JIT warm-up is a huge deal beyond the usual factors. See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Apr 27 '23 at 18:13
I wonder if one of the reasons its so slow is because of having to create the vector object. "FloatVector.fromArray", and then copy the results back to the array afterwards. This, in my opinion, is adding an unreasonable amount of overhead, effectively making the API useless in its current form. However, it says "This JEP also clarifies that alignment with Project Valhalla is a critical part of completing the Vector API." - I guess by using value objects, introduced in Valhalla, it will be possible to get around this bottleneck. — Tony Weston, May 21 '23 at 08:27

Why is the java vector API so slow compared to scalar?

3 Answers3

What SIMD is good for

Linked