5

I have followed the intel tutoriel for SIMD in Java with Panama. I want to do some simple operations on arrays:

Here the scalar and vector loop from the website :

public static void scalarComputation(float[] a, float[] b, float[] c) {
    for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * - 1.0f;
    }
}

public static void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    for (; i < (a.length & ~(species.length() - 1));
         i += species.length()) {
        FloatVector<Shapes.S256Bit> va = speciesFloat.fromArray(a, i);
        FloatVector<Shapes.S256Bit> vb = speciesFloat.fromArray(b, i);
        FloatVector<Shapes.S256Bit> vc = va.mul(va).
                add(vb.mul(vb)).
                neg();
        vc.intoArray(c, i);
    }

    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

When I measure the time :

float [] A = new float[N];
float [] B = new float[N];
float [] C = new float[N];

for(int i = 0; i < C.length; i++)
{
    C[i] = 2.0f;
    A[i] = 2.0f;
    B[i] = 2.0f;
}

long start = System.nanoTime();
for(int i = 0; i < 200; i++)
{
    //scalarComputation(C,A,B);
    //vectorComputation(C,A,B);
}
long end = System.nanoTime();        
System.out.println(end - start);

I always get a higher time for vector than scalar. Do you have an idea why? Thank you.

Jorn Vernee
  • 31,735
  • 4
  • 76
  • 93
K.Vu
  • 143
  • 5
  • What is the value of `N` in your benchmarks? – Ramón J Romero y Vigil Jun 07 '18 at 11:54
  • For small (enough) vectors, the overheads of passing information to the vector processing engine will exceed the savings, – Stephen C Jun 07 '18 at 11:59
  • N is equal to 345600 = 480*720 – K.Vu Jun 07 '18 at 11:59
  • 2
    @StephenC: I assume the OP is on x86-64, where all current microarchitectures integrate the vector ALUs *very* tightly. It's worth using SIMD to copy 16 bytes, if the source wasn't already in integer registers! Scalar FP math uses the same XMM0..15 registers as vector math anyway, like [`addss` (scalar single precision)](http://felixcloutier.com/x86/ADDSS.html) instead of [`addps` (packed single-precision)](http://felixcloutier.com/x86/ADDPS.html). 256-bit vectors will need YMM registers (unless Panama emulates wide vectors on top of 128-bit SIMD...). XMM0..15 is the low halves of YMM0..15 – Peter Cordes Jun 07 '18 at 12:15
  • If you're only repeating this 200 times, you might be seeing AVX 256-bit startup effects. If this runs at ~8 floats per clock (1 vmulps + 1 vfmsubps), your benchmark interval might be as short as 2 ms on a 4GHz Haswell or Skylake. That's still 8 mil clock cycles, so you're probably ok as far as upper halves of execution units powering up. Agner Fog observed the warm-up period to be 14 us on Skylake: http://www.agner.org/optimize/blog/read.php?i=415#415. (And BTW, 8 floats mul+FMA per clock isn't actually possible here because of the front-end bottleneck from load + store instructions.) – Peter Cordes Jun 07 '18 at 12:23
  • 2
    @K.Vu: did you warm up the JVM so it has time to JIT-compile the hot loop? And did you check that the scalar version doesn't optimize away the work in the loop? (Nothing reads the result). A good sanity-check is that the time scales linearly with the repeat-count. If it scales but not linearly, startup overhead is a problem. If it doesn't scale, your benchmark optimized away. If it is linear, then you might be measuring what you intended (but still no guarantee). – Peter Cordes Jun 07 '18 at 12:32
  • 1
    What hardware are you testing on? If you include the JIT-compiled asm for the inner loop in your question ([How to see JIT-compiled code in JVM?](https://stackoverflow.com/q/1503479)), and tag with `[x86]`, I can tell you why the Panama version is slower (if it wasn't a measurement error), assuming you're on x86 (http://agner.org/optimize/). e.g. maybe the JIT compiler auto-vectorizes your simple scalar loop using a better choice of instructions, if your Panama code forces an actual multiply by `-1.0` instead of a SUB or FMSUB instead of ADD or FMADD. Or misaligned arrays on Sandybridge? – Peter Cordes Jun 07 '18 at 12:51
  • I have an Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz. L1 cache is 128kB, L2 is 512kB and L3 is 4096kB. I have just set -XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics in my run configuration. – K.Vu Jun 07 '18 at 13:00
  • I have already done an HPC project in C with [SSE and SSE2](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE) and I was curious if Java could do the same. – K.Vu Jun 07 '18 at 13:04
  • 2
    You really need to write a [proper benchmark](https://stackoverflow.com/a/513259/581205). Java needs quite some time (seconds) till it runs properly fast and you benchmark seems to be just milliseconds. Forget it, use [JMH](http://openjdk.java.net/projects/code-tools/jmh/), anything else is wasting time. – maaartinus Jun 07 '18 at 13:19

1 Answers1

6

You are using the wrong branch: build from the vectorIntrinsics branch. You also need to use JMH to get proper measurements - here are some third party benchmarks written for the Vector API.

For the difference the Vector API makes to a dot product calculation, see here.