Why Java SIMD (Panama) is slower than scalar?

Question

I have followed the intel tutoriel for SIMD in Java with Panama. I want to do some simple operations on arrays:

Here the scalar and vector loop from the website :

public static void scalarComputation(float[] a, float[] b, float[] c) {
    for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * - 1.0f;
    }
}

public static void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    for (; i < (a.length & ~(species.length() - 1));
         i += species.length()) {
        FloatVector<Shapes.S256Bit> va = speciesFloat.fromArray(a, i);
        FloatVector<Shapes.S256Bit> vb = speciesFloat.fromArray(b, i);
        FloatVector<Shapes.S256Bit> vc = va.mul(va).
                add(vb.mul(vb)).
                neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

When I measure the time :

float [] A = new float[N];
float [] B = new float[N];
float [] C = new float[N];

for(int i = 0; i < C.length; i++)
{
    C[i] = 2.0f;
    A[i] = 2.0f;
    B[i] = 2.0f;
}

long start = System.nanoTime();
for(int i = 0; i < 200; i++)
{
    //scalarComputation(C,A,B);
    //vectorComputation(C,A,B);
}
long end = System.nanoTime();        
System.out.println(end - start);

I always get a higher time for vector than scalar. Do you have an idea why? Thank you.

For small (enough) vectors, the overheads of passing information to the vector processing engine will exceed the savings, — Stephen C, Jun 07 '18 at 11:59
@StephenC: I assume the OP is on x86-64, where all current microarchitectures integrate the vector ALUs *very* tightly. It's worth using SIMD to copy 16 bytes, if the source wasn't already in integer registers! Scalar FP math uses the same XMM0..15 registers as vector math anyway, like [`addss` (scalar single precision)](http://felixcloutier.com/x86/ADDSS.html) instead of [`addps` (packed single-precision)](http://felixcloutier.com/x86/ADDPS.html). 256-bit vectors will need YMM registers (unless Panama emulates wide vectors on top of 128-bit SIMD...). XMM0..15 is the low halves of YMM0..15 — Peter Cordes, Jun 07 '18 at 12:15
If you're only repeating this 200 times, you might be seeing AVX 256-bit startup effects. If this runs at ~8 floats per clock (1 vmulps + 1 vfmsubps), your benchmark interval might be as short as 2 ms on a 4GHz Haswell or Skylake. That's still 8 mil clock cycles, so you're probably ok as far as upper halves of execution units powering up. Agner Fog observed the warm-up period to be 14 us on Skylake: http://www.agner.org/optimize/blog/read.php?i=415#415. (And BTW, 8 floats mul+FMA per clock isn't actually possible here because of the front-end bottleneck from load + store instructions.) — Peter Cordes, Jun 07 '18 at 12:23
@K.Vu: did you warm up the JVM so it has time to JIT-compile the hot loop? And did you check that the scalar version doesn't optimize away the work in the loop? (Nothing reads the result). A good sanity-check is that the time scales linearly with the repeat-count. If it scales but not linearly, startup overhead is a problem. If it doesn't scale, your benchmark optimized away. If it is linear, then you might be measuring what you intended (but still no guarantee). — Peter Cordes, Jun 07 '18 at 12:32
What hardware are you testing on? If you include the JIT-compiled asm for the inner loop in your question ([How to see JIT-compiled code in JVM?](https://stackoverflow.com/q/1503479)), and tag with `[x86]`, I can tell you why the Panama version is slower (if it wasn't a measurement error), assuming you're on x86 (http://agner.org/optimize/). e.g. maybe the JIT compiler auto-vectorizes your simple scalar loop using a better choice of instructions, if your Panama code forces an actual multiply by `-1.0` instead of a SUB or FMSUB instead of ADD or FMADD. Or misaligned arrays on Sandybridge? — Peter Cordes, Jun 07 '18 at 12:51
I have an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz. L1 cache is 128kB, L2 is 512kB and L3 is 4096kB. I have just set -XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics in my run configuration. — K.Vu, Jun 07 '18 at 13:00
I have already done an HPC project in C with [SSE and SSE2](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE) and I was curious if Java could do the same. — K.Vu, Jun 07 '18 at 13:04
You really need to write a [proper benchmark](https://stackoverflow.com/a/513259/581205). Java needs quite some time (seconds) till it runs properly fast and you benchmark seems to be just milliseconds. Forget it, use [JMH](http://openjdk.java.net/projects/code-tools/jmh/), anything else is wasting time. — maaartinus, Jun 07 '18 at 13:19

score 6 · Answer 1 · 2018-07-19T18:18:05.590

6

You are using the wrong branch: build from the vectorIntrinsics branch. You also need to use JMH to get proper measurements - here are some third party benchmarks written for the Vector API.

For the difference the Vector API makes to a dot product calculation, see here.

edited Jul 19 '18 at 18:18

answered Jul 19 '18 at 17:57

How do you conclude the OP is using the wrong branch - I dont see that mentioned in their post? – Ben Hutchison Feb 16 '21 at 10:35

Why Java SIMD (Panama) is slower than scalar?

1 Answers1