Handling elements that are odd number using neon intrinsics

Question

I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be handled how to handle the remaining 3 elements. please help here is the code that I have written

 #include <arm_neon.h>
 #define SIZE 99
 void addition(unsigned char A[],unsigned char B[],unsigned short int *addres)
{
   uint8x8_t v,v1;
   int i=0;
   for (i=0;i<SIZE;i=i+8){
   v = vld1_u8(&A[i]); // load the array from memory into a vector
   v1=vld1_u8(&B[I]);
   uint16x8_t t = vaddl_u8(v,v1);
   vst1q_u16(addres+i,t); // store the vector back to memory
  }
}

Just add them manually. It won't get any simpler or more performant no matter what you do. These vectorized instructions usually are a lot more expensive and so if you can't fill it with at least 8 elements, normal code will be just as good, if not better. — , Mar 11 '22 at 11:44
There are multiple ways to handle vector fragments. Processing them with scalar code is one. Another is to use unaligned vector loads that load the final fragment plus some elements before it and use corresponding unaligned stores to store the results. This will overlap some elements, processing them twice, but sometimes that might be faster than scalar code… — Eric Postpischil, Mar 11 '22 at 12:22
Another is to rely on the fact that memory mapping works on pages, so a vector fragment beyond the end of an aligned vector must be mapped, so it is safe to load. So you can do the loads and use a mask to merge trailing elements from the destination that must not change. (This raises multithreading considerations if other code could be writing to that memory simultaneously.) What solution to use depends on circumstances. Another solution is to use buffers with extra space and process whole vectors even if some of the data in the final vector is not used. — Eric Postpischil, Mar 11 '22 at 12:23
Can you please tell me how to use unaligned vectors loads for this? not sure how it is done — R1608, Mar 11 '22 at 12:47
@rak: Given a start location `s` and a length in bytes `l`, the aligned 16-byte blocks would be at `s+0`, `s+16`, `s+32`, and so on. The last 16 bytes would be start at `s+l-16`. So, if you perform an unaligned load from `s+l-16`, you will get the last 16 bytes of the vector. (Naturally, this requires `l` be at least 16. If it is less, you will need other code.) For example, if the vector is 99 bytes long, you would do loops using the start locations `s+0`, `s+16`, `s+32`, `s+48`, `s+64`, and `s+80`, then one separate sequence with location `s+99-16` = `s+83`… — Eric Postpischil, Mar 11 '22 at 14:09
… The `s+83` unaligned load would overlap the aligned load at `s+80` for 13 bytes, but that is okay (as long as the element processing is parallel); it will just compute the same results a second time. Maybe that would be faster than scalar processing for 3 elements or maybe not, but these things depend on circumstances. — Eric Postpischil, Mar 11 '22 at 14:11
@Sahsahae Stay with your Python and do not spread any nonsense. The neon version is at least four times as fast. Of course, you don't have to worry about the performance at all. It's Python, and there is no way you can accelerate anything. — Jake 'Alquimista' LEE, Mar 12 '22 at 14:38

Jake 'Alquimista' LEE · Accepted Answer · 2022-03-12T12:39:52.913

3

The by far most efficient way dealing with residuals on SIMD I came up with so far is what I call "withold and rewind" method. (invented by me, I suppose)

void addl(uint16_t *pDst, uint8_t *pA, uint8_t *pB, intptr_t size)
{
    // assert(size >= 8);
    uint8x8_t a, b;
    uint16x8_t c;

    size -= 8; // withold

    do {
        do {
            size -= 8;
            a = vld1_u8(pA);
            b = vld1_u8(pB);
            c = vaddl_u8(a, b);
            vst1q_u16(pDst, c);
            pA += 8;
            pB += 8;
            pDst += 8;
        } while (size >= 0);

        pA += size;      // and rewind
        pB += size;
        pDst += size;
    } while (size > -8);
}

size can be any number greater equal 8.
There are three drawbacks though:

size HAS TO BE >=8 (no problem in most cases)
you can't use alignment specifier in aarch32 assembly (no problem in intrinsics)
not viable where the destination buffer also contains source data such as alpha blending bitBLT (no problem in this case)

PS: size has to be of the type intptr_t. The process will crash on 64bit machines otherwise.

edited Mar 12 '22 at 12:39

answered Mar 12 '22 at 12:29

Jake 'Alquimista' LEE

6,197
2
17
25

Other people have had the same idea (*[Jump back some iterations for vectorized remainder loop](https://stackoverflow.com/posts/comments/81692504)*), although this implementation of it in C looks good, if it actually compiles to asm like that. That's a nice invention. – Peter Cordes Sep 16 '22 at 07:30
I have used this method regularly - however I've also encountered some weird performance degrading. First the generated code is suboptimal (see https://github.com/llvm/llvm-project/issues/56330) and secondly some devices I've used and had instructions such as `ldr q0, [x0], x1` where x1==0, have seemed to have slowed down considerably, which could be related to the cache line lengths or the cpu doing some weird cache line evicting. – Aki Suihkonen Sep 17 '22 at 06:57
@AkiSuihkonen I can confirm the same problem persists with Android Clang, even the most recent version. It seems to be a chronic bug of ARM LLVM, not related to this particular method. That's why I always prefer writing assembly codes - no evil surprise. – Jake 'Alquimista' LEE Sep 17 '22 at 09:10

Handling elements that are odd number using neon intrinsics

1 Answers1

Linked