Compiling code with active neon flags leads to almost no improvement (and even deterioration)

Question

I am trying to understand the possible benefits of compiling C++ code with active neon flags in the gcc compiler. For that I made a little program that iterates through an array and makes simple arithmetic operations.

I changed the code so that anyone can compile and run it. If anyone would be nice enough to perform this test and share results, I'd be much appreciated :)

EDIT: I really ask t someone who happen to have a Cortex-A9 board nearby to perform this test and check if the result is the same. I'd really appreciate that.

#include <ctime>

int main()
{
    unsigned long long arraySize = 30000000;

    unsigned short* arrayShort = new unsigned short[arraySize];

    std::clock_t begin;

    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort = rand() % 100 + 1;
        arrayShort++;
    }

    arrayShort -= arraySize;

    begin = std::clock();
    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort += 10;
        *arrayShort /= 3;

        arrayShort++;
    }

     std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

    arrayShort -= arraySize;
    delete[] arrayShort;

    return 0;
}

Basically, I fill a 30000000 sized array with random numbers between 1 and 100, and then I go through all elements to sum 10 and divide by 3. I was expecting that compiling this code with active neon flags would lead to great improvements due to its capability of making multiple array operations at a time.

I am compiling this code to run in a Cortex A9 ARM board using Linaro toolchain with GCC 4.8.3. I compiled this code with and without the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon

I also replicated the code to run with an array of type unsigned int, float and double, and these are the results in seconds:

Array type unsigned short: 
With NEON flags: 0.07s
Without NEON flags: 0.089s

Array type unsigned int: 
With NEON flags: 0.524s
Without NEON flags: 0.529s

Array type float: 
With NEON flags: 0.65s
Without NEON flags: 0.673s

Array type double: 
With NEON flags: 0.955s
Without NEON flags: 0.927s

You can see that for the most part, there is almost no improvement in using the neon flags, and it even leads to worse results in the case of the array of doubles.

I really feel that I'm doing something wrong here, possibly you can help me interpreting these results.

What -O flag are you using? If it's "none" then note that benchmarking unoptimised code is utterly meaningless. I tried compiling to see what the assembly looks like, but being a bit C++-challenged I don't know where to find `Timer` and `RNG`. — Notlikethat, Feb 17 '15 at 19:13
-O3, or don't bother. The array of doubles will not benefit from ARMv7 NEON, and even if you shrink it down to float (which can), you need -ffast-math. — unixsmurf, Feb 17 '15 at 20:03
NEON does not support integer division, so there's nothing to vectorize. Try a multiply instead. — , Feb 17 '15 at 21:11
I've done this sort of test and the latest GCC still doesn't vectorize properly. Microsoft's ARM compiler can do some NEON vectorization. If you want fast ARM/NEON code, write assembly language. Depending on the compiler for optimized performance is rarely the right option (in my experience). — BitBank, Feb 18 '15 at 07:45
Auto-vectorizations are utterly useless most of the time, regardless of compiler. — Jake 'Alquimista' LEE, Feb 18 '15 at 10:09
Timer is custom made class and RNG belongs to openCV library, sorry about that. I'll change my code so anyone can copy paste and compile it. I was already compiling this code with the -O3 flag, sorry I forgot to mention. Adding the fast-math flag does indeed improve the processing time of the array of doubles, from 0.0927 to 0.88s Changing the division to a multiply leads to an improvement only in the float array process. It goes to 0.55s — Pedro Batista, Feb 18 '15 at 10:29
@Jake'Alquimista'LEE Is that so? So there is no point in trying to use this flags to optimize code for arm? Thats not what I read in my previous question @ http://stackoverflow.com/questions/28547697/coding-for-arm-neon-how-to-start/28549883?noredirect=1#comment45451709_28549883 People even state that in some cases the compiler beats handwritten arm neon assembly code. — Pedro Batista, Feb 18 '15 at 10:31
In any case, my point was to give the compiler a very straightforward easy case for it to optimize.. what would be easier than a per-element arithmetic operation on an array? I really was expecting a drastic change.. to the point that I still believe that I am doing something wrong, rather than this being the expected results. — Pedro Batista, Feb 18 '15 at 10:33
I changed the code to only contain functions from the std lib. — Pedro Batista, Feb 18 '15 at 12:53
It's really astounding how this myth came. Compilers are much worse than Google translator. When the compiler generated code runs little faster than hand written assembly, it means that the assembly code is lackluster. If the former wipes the floor with the latter, something is terribly wrong with the test code in first place, like embedding the test data within the code. In this case, the compiler removes the iteration altogether, just returning the pre-calculated result at build time while the hand written one executes exactly what the programmer wrote. — Jake 'Alquimista' LEE, Feb 18 '15 at 16:07
Also note that even disregarding compiler loop-flattening, trivial calculations on large buffers are generally a rubbish benchmark - if (as is fairly likely on a decent modern CPU) the scalar code can keep up with the external memory bandwidth, then it doesn't matter how much faster a couple of vector instructions execute when the remaining equivalent cycles are spent stalled waiting for a cache line fill. — Notlikethat, Feb 18 '15 at 23:00

James Greenhalgh · Answer 1 · 2015-02-19T22:39:43.693

3

I had to fix up your code with:

#include <iostream>
#include <cstdlib>

After which, GCC 5.0 autovectorizes your loop as so:

.L7:
    vld1.64 {d16-d17}, [r1:64]
    adds    r4, r4, #1
    vadd.i16    q8, q8, q11
    adc r5, r5, #0
    cmp r3, r5
    add r1, r1, #16
    vmull.u16 q9, d16, d20
    cmpeq   r2, r4
    vmull.u16 q8, d17, d21
    add lr, lr, #16
    vuzp.16 q9, q8
    vshr.u16    q8, q8, #1
    vstr    d16, [lr, #-16]
    vstr    d17, [lr, #-8]
    bhi .L7

So yes, the compiler can autovectorize the code, but is it any good? On a Cortex-A7 board I have nearby, I see the following times:

g++ ~/foo.cpp -O3
./a.out 
Time: 129.355 ms

g++ ~/foo.cpp -O3 -fno-tree-vectorize
./a.out 
Time: 430.405 ms

Which looks about what you would hope for a 4x vectorization factor (4x16-bit values).

In this case, I think the data and the generated assembly code speaks for itself, and refutes some of the claims in the comments above. The compiler can, and will, perform auto-vectorisation, and the performance you can achieve from it is a meaningful speedup.

Also of note, the compiler has beaten one of the expert assembly programmers from the comments!

NEON does not support integer division, so there's nothing to vectorize. Try a multiply instead.

True in the general case, yes. But efficient sequences exist to divide by particular constants using Neon, and '3' happens to be one of those constants!

My Linaro/Ubuntu GCC 4.8.2 system compiler also vectorizes this code, producing very similar code to the above, with similar timings.

edited Feb 19 '15 at 22:39

answered Feb 19 '15 at 22:28

James Greenhalgh

2,401
18
17

Thanks for you answer. A few questions I have regarding this: 1 - I didn't quite understand what you changed in my code. you only added those include directories? What exactly changed between my code and yours? 2 - For which array did you test that loop? Because for the array of shorts my version only takes 70 ms ( about 90ms without neon flags) to perform the calculations, whilst yours takes 123 ms. Could you clarify that? 3 - Can you explain me why neon can generally vectorize multiplications but not divisions? Since a division can be always replaced by a multiplication( 3 / 5 = 3 * 0.2) – Pedro Batista Feb 20 '15 at 02:47
Ah, never mind point number 2, I just realized you are using an A7 board. – Pedro Batista Feb 20 '15 at 02:52
1: Yes, all I did was copy-paste the code in the question, then add the two extra includes to get access to `rand()` and `std::out`, 3: Forming the reciprocal for multiplication requires either a division or repeated application of the `vrecpe/vrecps` instructions to estimate and refine its value. Each of these approaches is only applicable for floating point values, so wouldn't help with the integer code in your question. – James Greenhalgh Feb 20 '15 at 07:41
Well, then I guess for the case of a Cortex A9 the compiler cannot reduce the time of execution. – Pedro Batista Feb 20 '15 at 16:56
In your timings above, you have a ~20% performance improvement for the "short" case, so the compiler has done something to improve performance. – James Greenhalgh Feb 20 '15 at 18:03
Thats a little pale, when compared to your 400% :p – Pedro Batista Feb 20 '15 at 20:34

score 0 · Answer 2 · answered Feb 24 '15 at 13:27

I attempted to re-write this code using the arm_neon.h intrinsics, and the results are very surprising.. so much so that I need some help interpreting them.

Here is the code:

#include <ctime>
#include <stdio.h>
#include <cstdlib>
#include <arm_neon.h>

int main()
{
    unsigned long long arraySize = 125000000;

     std::clock_t begin;

    unsigned short* arrayShort = new unsigned short[arraySize];

    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort = rand() % 100 + 1;
        arrayShort++;
    }

    arrayShort -= arraySize;

    uint16x8_t vals;
    uint16x8_t constant1 = {10, 10, 10, 10, 10, 10, 10, 10};
    uint16x8_t constant2 = {3, 3, 3, 3, 3, 3, 3, 3};

    begin = std::clock();
    for (unsigned long long n = 0; n < arraySize; n+=8)
    {
        vals = vld1q_u16(arrayShort);
        vals = vaddq_u16(vals, constant1);
        vals = vmulq_u16(vals, constant2);

//      std::cout << vals[0] <<  "   " << vals[1] <<  "   " << vals[2] <<  "   " << vals[3] <<  "   " << vals[4] <<  "   " << vals[5] <<  "   " << vals[6] <<  "   " << vals[7] <<  std::endl;

        arrayShort += 8;
    }

    std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

    arrayShort -= arraySize;
    delete[] arrayShort;

    return 0;
}

So, now I am creating a 125 million element long array of unsigned shorts. Then I go over 8 elements at a time and and add 10 and then multiply it by 3.

On a cortex A9 board, the plain C++ version of this code takes 270 milliseconds to process that array, while this NEON code takes only 20 milliseconds.

Now, my expectations before seeing the results weren't to high, but, the best scenario in my head was a 8x time reduction. I cannot explain how this lead to a 13.5x reduction in execution time.. and I'd appreciate some help interpreting these results.

I've obviously seen the result output of the math being done and I can assure the code is working and the results are very coherent.

You probably meant to store the result somewhere after calculating it in to vals. As currently written, there is no reason for the compiler to perform any of the operations in the loop as the result is unused. — James Greenhalgh, Feb 24 '15 at 21:07

Compiling code with active neon flags leads to almost no improvement (and even deterioration)

2 Answers2