3

I'm porting some ARM NEON code to 64-bit ARM-v8, but I can't find a good documentation about it.

Many features seems to be gone, and I don't know how to implement the same function without using them.

So, the general question is: where can I find a complete reference for the new SIMD implementation, including explanation of how to do the same simple tasks which are explained in the many ARM-NEON tutorials?

Some questions about particular features:

1 - How do I load a value in all the lane of a Dx register? The old code was

    mov R0, #42
    vdup.8 D0, R0

My guess is:

    mov W0, #42
    dup V0.8B, W0

2 - How do I load multiple Dx/Qx registers with interleaved data? In the old code this was:

    vld4.8 {D0-D3}, [R0]!

But I can't find anything in the new docs.

I understand it's a completely new model, but it's not very well-documented (or at least, I'm unable to find any reference with readable samples)

G B
  • 2,951
  • 2
  • 28
  • 50
  • 1
    I'd agree `dup` looks to be the equivalent of `vdup`; the equivalent of `vld4` seems to be, perhaps unsurprisingly, `ld4`. It might be worth trying to track down a copy of the old "ARMv8 Instruction Set Overview" PDF - it's gone from the ARM website since the proper ARMv8-A ARM was published, but was a lot easier to skim. – Notlikethat Jan 20 '15 at 16:43

1 Answers1

4

The documentation on using ARMv8 in Android is not very good, but for your specific questions, they're answered quite well in this document:

ARMv8 Instruction Set Overview

To answer your specific questions:

mov R0, #42
vdup.8 D0, R0

becomes

mov w0,#42
dup v0.8b,w0

and

vld4.8 {d0-d3}, [r0]!

becomes

 ld4 {v0.8b,v1.8b,v2.8b,v3.8b},[x0],#32
artless noise
  • 21,212
  • 6
  • 68
  • 105
BitBank
  • 8,500
  • 3
  • 28
  • 46
  • Thanks for this first answer, now I'm trying to figure out the rest. For example, it seems that the simple PLD instruction has become a monster of complexity... – G B Jan 20 '15 at 16:58
  • No, wait... dup.8b and ld4.8b don't exist, the registers are suffixed, not the mnemonics. – G B Jan 20 '15 at 17:01
  • Glad I could help. PLD has mutated to support many options for reading/writing and L0/L1/L2/L3 cache. I tried to use it on Xcode, but it doesn't accept the syntax in the ARM documentation. Best way to get started on ARM64 is to have GCC output ASM source code from C and figure out what it does. – BitBank Jan 20 '15 at 17:01
  • On Xcode, it accepts the syntax I specified, on Android/GCC, it wants to see the size suffix on the destination vector register. Answer updated to reflect GCC preferred syntax. – BitBank Jan 20 '15 at 17:03
  • And that's the next problem: now it's working fine on Android and I need to compile the same code for iOS. – G B Jan 23 '15 at 08:48
  • It took a little while to get the same code working on Android and iOS. I gave up and forked the code due to the differences. If you ask a question on getting ARM64 code to work with both Android and iOS, I'll post a detailed answer with the specifics for each system. – BitBank Jan 23 '15 at 10:31
  • 1
    @MaximAkristiniy vld4.8 {d0-d3} loads 32 bytes also (4 x 64-bit registers) – BitBank Oct 05 '17 at 19:48