AArch64 - Running ARM and ASIMD instructions in parallel

Question

I want to implement a code in assembly instruction using both ARM assembly instruction and ASIMD instructions in parallel. My first question is, whether this is can be done on ARMv8? Based on this thread, it's possible on ARMv7, however data transfer between NEON and ARM registers takes considerable amount of time. Second, I am looking for a way that I can implement my assembly code in parallel. Here is what I am trying to do:

.
.
.
<ASIMD instruction>
<ASIMD instruction>
<ASIMD instruction>
<Data MOV between ASIMD vectors and ARM Reg>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<Data MOV between ARM Reg and ASIMD vectors>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
<ARM assembly instruction> ------- <ASIMD instruction>
.
.
.

I am wondering if I can do this using two threads. I am working on ARM-CortexA53 microprocessor. I also have access to ARM-CortexA57, but I think these platforms are roughly the same and they have equal capabilities.

Cortex-A53 is a mostly-dual-issue in-order design; Cortex-A57 has out-of-order execution fed by a 3-wide decode/dispatch stage; they are anything but "roughly the same". — Notlikethat, Mar 09 '16 at 22:41
@Notlikethat Thanks for your clarification. I've done some research and now I understand A57 and A53 have completely different architecture — A23149577, Mar 09 '16 at 22:57

score 4 · Accepted Answer · answered Mar 09 '16 at 22:41

I think your comments on threading are misplaced here, or you have a background in a hyper-threaded (or other simultaneous multithreading) architecture. Neither Cortex-A57 or Cortex-A53 are SMT microarchitectures, so at any time you will only have one thread executing on one core. This means your idea of having one thread for Advanced SIMD instructions and one thread for integer/A32/T32 (what you call "ARM instructions") instructions is not going to result in good overall utilisation of a multi-core system.

The thread you linked to discusses a model for the Cortex-A8 microarchitecture in which data dependencies carried through Neon instructions back to A32 instructions cause pipeline bubbles (note that the other comment saying this has to do with memories being synced is incorrect). While it is the case that there is some cost to moving data from Advanced SIMD registers to core registers, the cost is much lower than that thread suggests (see, for example, the Cortex-A57 Software Optimisation Guide, which gives latency numbers for each instruction).

The performance benefits you gain from making use of the vectorised Advanced SIMD instructions will depend on the blend of instructions you intend to use in the A32 and Advanced SIMD portions of your algorithm. Moving the data around too often will have the obvious impact on your execution speed - the more time you spend moving data, the less time you are spending doing the work you intend to do!

The instruction interleaving you propose above is a common way to expose instruction level parallelism, and is likely to work well within a single thread.

Thanks for your answer. The main algorithm which I am trying to implement is multi-precision arithmetic like multi-word multiplication. In order to do that for relatively big integers, I multiply my operands using ASIMD vectors, but since there is no `ADCS` in ASIMD, I was thinking of using A32 instructions for carry propagation. I am not sure if it's a good approach, but I am going to give it a try and evaluate the performance — A23149577, Mar 09 '16 at 23:01

score 2 · Answer 2 · answered Mar 09 '16 at 22:41

I am not sure what you mean with "In parallel". None of Cortex-A53 or Cortex-A57 support multithreading (Although it is possible to have several CPUs in the same chip, which is a different matter).

What you can do however on Cortex-A57 (Certainly less on A53) is to use the fact that execution is mostly out-of-order. So it you don't have dependencies between the instructions, the long instruction can execute, and during this time, you could execute the shorter instructions. But really using it is very difficult, and the best may be to trust that the CPU will do as much out-or-order execution as it can.

AArch64 - Running ARM and ASIMD instructions in parallel

2 Answers2