Convert ARM 32-bit neon to ARM 64-bit neon

Question

I have the following 32-bit neon code that simply extracts an image:

extractY8ImageARM(unsigned char *from, unsigned char *to, int left, int top, int width, int height, int stride)
from: pointer to the original image
to: pointer to the destination extracted image
left, top: position where to extract in the original image
width, height: size of the extracted image
stride: width of the original image

and here is the assembly code:

.text
.arch armv7-a
.fpu neon
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM

extractY8ImageARM:
from    .req r0
to  .req r1
left    .req r2
top .req r3
width   .req r4
height  .req r5
stride  .req r6
tmp .req r7

    push {r0-r7, lr}

//Let's get back the arguments
    ldr width, [sp, #(9 * 4)]
    ldr height, [sp, #(10 * 4)]
    ldr stride, [sp, #(11 * 4)]

//Update the from pointer. Advance left + stride * top
    add from, from, left
    mul tmp, top, stride
    add from, from, tmp

.loopV:
//We will copy width
    mov tmp, width

.loopH:
//Read and store data
    pld [from]
    vld1.u8 { d0, d1, d2, d3 }, [from]!

    pld [to]
    vst1.u8 { d0, d1, d2, d3 }, [to]!

    subs tmp, tmp, #32
    bgt .loopH

//We advance the from pointer for the next line
    add from, from, stride
    sub from, from, width

    subs height, height, #1
    bgt .loopV


    pop {r0-r7, pc}

.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp

I need to port it to 64-bit neon. can anyone help me to do the translation? I have read this white paper http://malideveloper.arm.com/downloads/Porting%20to%20ARM%2064-bit.pdf so I understand more or less the differences.

My code is simple and it would be a good example how to pass arguments and load/store data in a 64-bit neon assembly file. I prefer to avoid intrinsic.

See my answer here: http://stackoverflow.com/questions/28050300/porting-arm-neon-code-to-aarch64-many-questions/28050834#28050834 It really would be a good idea to use intrinsics. Your NEON code is not very optimized and would be portable to both ARM32 and ARM64 if you used intrinsics. — BitBank, Jan 19 '17 at 22:12
If you insist on writing ARM64 assembly language, you can see my github project here for a complete example: https://github.com/bitbank2/gcc_perf — BitBank, Jan 19 '17 at 22:14
Even with intrinsics, there are too many changes between neon-32bit to neon-64bit. — gregoiregentil, Jan 20 '17 at 04:46
I got the ld1 / st1 instructions. Thanks. That's useful. I'm unsure how to translate the push/pop. — gregoiregentil, Jan 20 '17 at 04:47
It would be interesting to understand why the 32-bit neon code is not optimized. This is just a loop with vld1 and vst1. What could you do more or differently? Note that the width needs to be divisible by 16, not 32. — gregoiregentil, Jan 20 '17 at 04:48
For push/pop in aarch64, see this SO answer: http://stackoverflow.com/questions/27941220/push-lr-and-pop-lr-in-arm-arch64. There really aren't a lot of changes between NEON 32 and 64-bit intrinsics unless you're using encryption/crc32 calculations. 99% of what you normally do in NEON will use exactly the same intrinsics. Your loop is not properly optimized because of several reasons - your PLD doesn't do much because you need to prefretch ahead of where you're reading. You're also not interleaving enough registers to hide more of the load latency. Auto-increment addr may be slower, etc. — BitBank, Jan 20 '17 at 12:55
You're right that intrinsics work from 32 to 64-bit. I would like to post an answer for the records and I have a code that works but not in all situation. Any idea what's wrong in http://gentil.com/tmp/test.zip ? — gregoiregentil, Jan 21 '17 at 07:12
I looked at it, but you should post the code here. Your bug is that you're not adjusting the copy height based on the starting offset so you're going past the end of the buffer. 100+256 height = 356 last line :( — BitBank, Jan 21 '17 at 13:53
I don't think it's the problem because the original image is 720 > 100 + 256 (I could add a check in c code but it's fine). Note that left, top are the starting positions in the original image. width, height are the size of the final image so everything is fine. The equivalent c cde would be: for (j) for (i) to[j * width + i] = from[j * stride + i + left + top * stride]; — gregoiregentil, Jan 21 '17 at 16:06
I just noticed that you don't have an explicit return from your asm code. Surprising that it works in any conditions. — BitBank, Jan 21 '17 at 16:21
It seems correct. No crash when adding ret at the end. If you want to post the code as the answer (so that you can get an approved answer), I will mark it. Otherwise, I will post it myself later. — gregoiregentil, Jan 21 '17 at 17:02

score 0 · Accepted Answer · answered Feb 05 '17 at 01:20

The whole code looks like this:

.text
.arch armv8-a
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM

extractY8ImageARM:
from    .req x0
to  .req x1
left    .req x2
top .req x3
width   .req x4
height  .req x5
stride  .req x6
tmp .req x9

    add from, from, left
    mul tmp, top, stride
    add from, from, tmp

.loopV:
    mov tmp, width

.loopH:
    ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [from], #64

    st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [to], #64

    subs tmp, tmp, #64
    bgt .loopH

    add from, from, stride
    sub from, from, width

    subs height, height, #1
    bgt .loopV

    ret


.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp

Convert ARM 32-bit neon to ARM 64-bit neon

1 Answers1