How can I optimize this code with ARM NEON?

Question

I'm trying to optimize some code in order to reduce as much as possible execution times. This is the code:

    int shifter=0;

    // now iterate through all the pairings
    UINT32_ALIAS* ptr2=(UINT32_ALIAS*)ptr;
    const BriskShortPair* max=shortPairs_+noShortPairs_;
    for(BriskShortPair* iter=shortPairs_; iter<max;++iter){
        t1=*(_values+iter->i);
        t2=*(_values+iter->j);
        if(t1>t2){
            *ptr2|=((1)<<shifter);

        } // else already initialized with zero
        // take care of the iterators:
        ++shifter;
        if(shifter==32){
            shifter=0;
            ++ptr2;
        }
    }

I was wondering if it's possible to somehow parallelize this using NEON. Is it possible? Thank you

EDIT: The context of this code is the BRISK features detector (http://www.asl.ethz.ch/people/lestefan/personal/BRISK) I'm trying to optimize this code for an ARM architecture. The piece of code I'm referring to has the following structure:

-an external for cycle that scans a certain number of points

-for each one of these points there a certain number of other points around it (a fixed number) and each one of these has an intensity value associated.

-in an internal for cycle fixed pairs of points are compared on the basis of their intensity value and the result of this comparison can be 0 or 1 and this value is put in a vector.

The code I posted here is the internal for cycle.

I'm not asking for code, I'm just trying to understand if there is a way to optimise the original code through NEON. Since I'm no NEON expert I don't even know if what I'm asking it's possible. If it is, I'm going to implement it myself. What I need is just some kind of general advice about the feasibility. — user2696208, Oct 04 '13 at 09:14
Without anymore context / compilable code, the only thing I can suggest at this point is to _hint_ to the compiler that this should be vectorized / would benefit from vectorization. See http://stackoverflow.com/questions/14256156/how-to-give-hint-to-gcc-about-loop-count for the options. — FrankH., Oct 04 '13 at 09:18
Yes from the look of the scalar code it's certainly doable, but you have a steep learning curve to deal with if you have not done any SIMD programming before. It looks like you're just generating an array of bits whose values depend on the comparison of two vectors. I suggest looking at some of the NEON answers on SO as a starting point. BTW, what are the types of `iter->i` and `iter->j` ? — Paul R, Oct 04 '13 at 10:36
@PaulR The types are int. " It looks like you're just generating an array of bits whose values depend on the comparison of two vectors.". That's exactly what the code does. Do you think I could obtain a significant increase in the performances? Since it's not an easy thing for me to do I'd like to understand if it's worth the effort. — user2696208, Oct 04 '13 at 10:53
Assuming you mean 32 bit ints, then yes, there's a theoretical 4x improvement possible, although I would expect it to be in the 2x to 3x range. Also note that memory bandwidth may be a limiting factor if your data set is large, as you have very few ALU operations per load/store. — Paul R, Oct 04 '13 at 10:59
@PualR Ok thank you. Should I use NEON intrinsics? I actually don't know where to start from. What is SO? — user2696208, Oct 04 '13 at 11:14
Yes, definitely use NEON intrinsics initially - you can go to asm later if needed but I hope for your sake that is not the case. "SO" is where you are right now. ;-) Search for the `neon` tag and look at some of the answers where scalar code has been converted to SIMD using NEON intrinsics. — Paul R, Oct 04 '13 at 12:54
Specifically, the [tag:neon] [vtst](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHIFIHC.html) instruction will be useful to convert the `if(t1>t2)` condition to SIMD. You won't have to initialize to *zero* if you handle 32 items at a time and don't have to do the shifting; use the byte lane instructions to form your result directly. — artless noise, Oct 04 '13 at 15:10
@PaulR Ok so If I got it right I can do 4 comparisons in each cycle, correct?. I can use 128 bit registers right? For example a uint32x4_t. It's not clear to me if using operations like vcgtq_u32 to compare operands the comparison is done on the single 32bit blocks of the vector as I need. I also don't understand why the suggestion to use the vtst instruction instead of the vcgt. — user2696208, Oct 04 '13 at 20:06
@PaulR You aren't exactly doing a favor by recommending NEON intrinsics to newcomers. For people not familiar with SMID and NEON instructions intrinsics are rather confusing and not very well documented. Asm on the other hand indeed has a steep learning curve, but the time invested in it is worth every second considering the performance and the absolute immunity to so many strange toolchain behavior. If you google, you'll find lots of (unanswered) questions complaining about disappointing performance and strange errors related to NEON intrinsics. What a waste of resources. — Jake 'Alquimista' LEE, Oct 07 '13 at 15:09
@Jake: yes, I have some experience with NEON intrinsics and disappointing performance - however for a noob I still think they are the quickest and easiest way to get a reasonable performance improvement, and then you still have the option to go to asm if you need better performance still. — Paul R, Oct 07 '13 at 15:34
@PaulR You might think intrinsics to be easy because you already know the instructions and how SIMD works. For people without any experiences in either of them intrinsics are nothing more than a wild collection of confusing macros all looking alike. It's just like MFC on Windows : You might get the first hello-world very quickly when following a tutorial, but you are left clueless beyond that due to the lack of fundamentals. On SO, I hardly answer questions related to intrinsics because all I can do is telling what I guess and not what's wrong exactly. — Jake 'Alquimista' LEE, Oct 07 '13 at 15:59
Well, each to their own, but I use intrinsics on x86, POWER/PowerPC, Cell, ARM and couple more esoteric platforms - in all cases but ARM intrinsics are hard to beat for performance, even given an expert asm programmer. The ARM deficiency I am sure will improve with time as the compilers improve, and getting things working first with intrinsics makes for a less painful transition to asm, if it turns out to be needed. — Paul R, Oct 07 '13 at 16:35
@PaulR I know quite a few people hopelessly stomped by intrinsics. (There are many on the web as well) Once I taught them some NEON instructions in assembly they finally understood how to get things to work. I think intrinsics are nothing for beginners, but a convenient tool for experienced SIMD programmers like you when the performance isn't that crucial. But even then, I'll pass since I don't want to be harassed by my ex-clients every time when they build their projects with a newer version of GCC or whatever. The "absolute immunity" granted by hand written assembly is truly godsend. — Jake 'Alquimista' LEE, Oct 07 '13 at 18:11

Jake 'Alquimista' LEE · Accepted Answer · 2013-10-08T01:15:48.603

3

EDIT : I initially misunderstood the original source code. Here is the correct version, completely rewritten. (55 cycles / iteration)

Although it's not as easy as assumed with the initial version below, NEON can handle this extremely well, resulting in an eye-popping performance boost compared to the original C implementation.

With the right tweaks, you might get additional gain in performance (less than 50 cycles / iteration). The readability will suffer heavily then though.

Have fun!

    AREA    BRISK_ASM_NEON, CODE, READNOLY
    EXPORT  yourFunction
    CODE32

yourFunction    FUNCTION

loop
    pld     [r0, #192]
    vld2.32     {q8, q9}, [r0]!
    vld2.32     {q10, q11}, [r0]!
    pld     [r0, #192]
    vld2.32     {q12, q13}, [r0]!
    vld2.32     {q14, q15}, [r0]!

    vcgt.u32    q8, q8, q9
    vcgt.u32    q9, q10, q11
    vcgt.u32    q10, q12, q13
    vcgt.u32    q11, q14, q15

    pld     [r0, #192]
    vld2.32     {q12, q13}, [r0]!
    vld2.32     {q14, q15}, [r0]!
    pld     [r0, #192]
    vld2.32     {q0, q1}, [r0]!
    vld2.32     {q2, q3}, [r0]!

    vcgt.u32    q12, q12, q13
    vcgt.u32    q13, q14, q15
    vcgt.u32    q14, q0, q1
    vcgt.u32    q15, q2, q3

    vsli.32     q8, q10, #8
    vsli.32     q9, q11, #8
    vsli.32     q8, q12, #16
    vsli.32     q9, q13, #16
    vsli.32     q8, q14, #24
    vsli.32     q9, q15, #24

    vsli.8      d16, d17, #2
    vsli.8      d18, d19, #2
    vsli.8      d16, d18, #4

    vbic.i8     d16, #0xaa
    vshr.u64    d17, d16, #31
    vorr        d16, d16, d17

    vst1.32     {d16[0]}, [r1]!

    subs        r2, r2, #32
    bgt     loop

    bx  lr

    ENDFUNC
    END

=============================================================================

!!!!!!! The code below is INVALID !!!!!!!!

=============================================================================

It's a piece of cake with NEON.

Here's your "miracle" :

prototype : void yourFunc(unsigned int * pPair, unsigned int * ptr2, unsigned int count);

    AREA    BRISK_ASM_NEON, CODE, READNOLY
    EXPORT  yourFunction
    CODE32

yourFunction    FUNCTION
    adr r12, shifter_table
    vpush   {q4-q7}
    vldmia  r12, {q0-q7}

loop
    vld1.32 {q8, q9}, [r1]
    vorr    q10, q8, q0
    vorr    q11, q9, q1
    vld2.32 {q12, q13}, [r0]!
    vld2.32 {q14, q15}, [r0]!
    vcgt.u32    q12, q12, q13
    vcgt.u32    q13, q14, q15
    vbsl    q12, q10, q8
    vbsl    q13, q11, q9
    vst1.32 {q12, q13}, [r1]!

    vld1.32 {q8, q9}, [r1]
    vorr    q10, q8, q2
    vorr    q11, q9, q3
    vld2.32 {q12, q13}, [r0]!
    vld2.32 {q14, q15}, [r0]!
    vcgt.u32    q12, q12, q13
    vcgt.u32    q13, q14, q15
    vbsl    q12, q10, q8
    vbsl    q13, q11, q9
    vst1.32 {q12, q13}, [r1]!

    vld1.32 {q8, q9}, [r1]
    vorr    q10, q8, q4
    vorr    q11, q9, q5
    vld2.32 {q12, q13}, [r0]!
    vld2.32 {q14, q15}, [r0]!
    vcgt.u32    q12, q12, q13
    vcgt.u32    q13, q14, q15
    vbsl    q12, q10, q8
    vbsl    q13, q11, q9
    vst1.32 {q12, q13}, [r1]!

    vld1.32 {q8, q9}, [r1]
    vorr    q10, q8, q6
    vorr    q11, q9, q7
    vld2.32 {q12, q13}, [r0]!
    vld2.32 {q14, q15}, [r0]!
    vcgt.u32    q12, q12, q13
    vcgt.u32    q13, q14, q15
    vbsl    q12, q10, q8
    vbsl    q13, q11, q9
    vst1.32 {q12, q13}, [r1]!

    subs    r2, #32
    bgt loop

    vpop    {q4-q7}
    bx  lr

    ENDFUNC

shifter_table
    DCD (1<<00), (1<<01), (1<<02), (1<<03), (1<<04), (1<<05), (1<<06), (1<<07), (1<<08), (1<<09), (1<<10), (1<<11), (1<<12), (1<<13), (1<<14), (1<<15)
    DCD (1<<16), (1<<17), (1<<18), (1<<19), (1<<20), (1<<21), (1<<22), (1<<23), (1<<24), (1<<25), (1<<26), (1<<27), (1<<28), (1<<29), (1<<30), (1<<31)

    END

The code above is just moderately optimized (interlocks here and there), and works only if count is a multiple of 32.

That's as far as I go managing readability and when working "unprofessionally".

47 cycles / iteration isn't bad. The rest is up to you.

Good luck!

edited Oct 08 '13 at 01:15

answered Oct 05 '13 at 07:02

Jake 'Alquimista' LEE

6,197
2
17
25

Thank you! I really want to understand the code now and then try it. Is there a good manual for these instructions? – user2696208 Oct 05 '13 at 07:50
1

Visit my blog armneon.blogspot.com There are some links in addition to my own tutorials. – Jake 'Alquimista' LEE Oct 05 '13 at 07:59
@user2696208 I did a graving mistake in misinterpreting your code. I'm updating my answer with a "revised" version. – Jake 'Alquimista' LEE Oct 06 '13 at 01:12
Ok thank you again.That will definitely improve performances. I'll try to understand the code now. – user2696208 Oct 06 '13 at 15:51
I'm trying to understand your code and I've got some questions. The prototype of the function would be the same as the answer below? If that is the case I don't understand why the first argument is a pointer to an int.(my pair is stored in a struct with two fields, both int).Also, why #192 as an offset when you do the PLD? Excuse me but I'm really new to this. – user2696208 Oct 06 '13 at 18:37
2

1) A structure consisting of two integers is internally nothing else than an integer array of size two, and an array of this structure is again an integer array as a whole. Simple typecasting will do. Should it contain other elements, it would be wise separating them in an additional structure 2) On ARMv7, a cache line typically consists of 64 bytes. And it's advised that you read three lines ahead. Since a single iteration is rather long in this function, it might be possible that reading two lines ahead instead of three results in better performance (128 instead of 192) – Jake 'Alquimista' LEE Oct 07 '13 at 01:38
Ok got it. I think I understood all the code except the last part, in particular the vbic.i8 d16, #0xcc instruction.Why do I have to do that operation? I also have troubles compiling. I'm using a beagleboard with an ARM Cortex-A8. I compile with g++ passing as parameter -mfpu=neon but I get some errors like: Error: bad instruction `area BRISK_ASM_NEON,CODE,READNOLY'. And also Error: immediate value out of range vbic.i8 d16,#0xcc . Is it a problem of parameters I pass to the compiler? – user2696208 Oct 07 '13 at 12:05
1) In the final phase, the bottom half of d16 contains even bits twice while the top half contains odd bits twice. We have to interleave them. 0xcc is 10101010 in binary. vbic with 0xcc is equivalent to "bottom &= ~0xcc; top &= ~0xcc;" Then vshr combined with vorr is equivalent to "bottom |= (top<<1);" – Jake 'Alquimista' LEE Oct 07 '13 at 13:44
2) The code I wrote is in intel syntax (DS-5 Professional). You have to convert this to AT&T syntax by hand in order to assemble it with GCC toolchain. And I'm not very familiar with the latter one (and too lazy, too). Further, you have to put the code in a separate assembly file with a .s extension (myFile.s). Just Google how to do this. It won't be that hard :) I'm quite surprised that you actually understood my code since instructions like vsli aren't meant to be in a beginner's course. – Jake 'Alquimista' LEE Oct 07 '13 at 13:56
I should be able to do it thank you.(I found that it can be done by simply adding the .intel_syntax directive, I'm going to try it).It wasn't easy for me to figure out the role of the vsli you're right, but in the end I got it. I'm still confused by the vbic though, isn't 11001100 the binary equivalent of 0xcc and not 10101010? – user2696208 Oct 07 '13 at 19:21
oops, shame on me..... you are right. It should be 0xaa. I'll change the code correspondingly. – Jake 'Alquimista' LEE Oct 08 '13 at 01:07
I managed to compile it, but then when I execute the code I get a segmentation fault. I guess it has something to do with funcion arguments not passed correctly but I don't know how to debug it. – user2696208 Oct 08 '13 at 14:53
Are you calling this from a C++ function? In that case, you have to wrap the calling part with "extern C {....". You'll find enough examples on the web. – Jake 'Alquimista' LEE Oct 09 '13 at 00:11
I finally managed to get it work.The only problem is given by the vbic, because it seems that it cannot be used like that with a i8 data type. The imm value has to match a certain pattern (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHFBEDA.html). I don't understand how to use it correctly. – user2696208 Oct 14 '13 at 14:22

How can I optimize this code with ARM NEON?

1 Answers1

!!!!!!! The code below is INVALID !!!!!!!!