UADDL vs UADDL2 in Aarch64 NEON

Question

NEON Assembly

I am trying to understand the arm-v8 NEON. Let me tell an example what I am trying to do.

I load 16 Bytes (pixels in uchar) from array A. Now I want to try "lengthening ADD" to ushort. From the documentation, I see UADDL and UADDL will do lengthening add for lower half and upper half of the source registers respectively. I could write following code to get it working:

ld1 {V10.16B}, [x0]

uaddl V11.8H, V10.8B, V10.8B    
uaddl2 V12.8H, V10.16B, V10.16B 

st1 {V11.8H}, [x1], #16 
st1 {V12.8H}, [x1], #16

NEON Intrinsics

Coming to NEON Intrinsics, Syntax is as follows: (Refer Page 8)

uint16x8_t vaddl_u8 (uint8x8_t a, uint8x8_t b)
uint16x8_t vaddl_high_u8 (uint8x16_t a, uint8x16_t b)

Here, input to both the functions are of different types.

So once I load a uint8x16_t variable, how am I supposed to pass this variable to vaddl_u8? Is there any casting that can I do? Or do I have to copy the lower half to another variable? (That means, it is an extra cost)

So my question is, how can I implement this piece of assembly code with NEON intrinsics?

UPDATE

I am using aarch64-linux-gnu-g++ (gcc version 5.4.0) in Ubuntu 16.04.

You can cast `uint8x16_t` to `uint8x8_t` for free, right? With a cast intrinsic, I think. Do that for the low half, and it should compile to the asm you'd hope for. — Peter Cordes, Dec 11 '17 at 16:37
@PeterCordes : That's what I am searching for, but couldn't find it. — Abid Rahman K, Dec 11 '17 at 18:12

score 0 · Accepted Answer · answered Dec 11 '17 at 17:56

0

You should know that both uint8x16_t and uint8x8_t are different data types.

Below is what I would do:

uint8x16_t a, b, c;
uint8x8_t low, high;
.
.
.
a = vld1q_u8(pSrc);

low = vget_low_u8(a);
high = vget_high_u8(a);

b = vaddl_u8(low, low);
c = vaddl_u8(high, high);

vst1q_u8(pDst++, b);
vst1q_u8(pDst++, c);

BTW, may I ask where you have vaddl_high_u8 from???

The auto-completion on Android Studio 3.0.1 doesn't show it as a viable option.

answered Dec 11 '17 at 17:56

Jake 'Alquimista' LEE

6,197
2
17
25

Updated question. I am using gcc in Ubuntu 16.04 – Abid Rahman K Dec 11 '17 at 18:12
Hi, thanks for the answer. I didn't know about vget_low_u8. But it also leads to an extra cost, right? I mean you need to call 2 extra instructions for rearranging the data. But in assembly, it is not required. So I wonder if there is anything in intrisic that can directly pass the lower half of register to vaddl function. – Abid Rahman K Dec 11 '17 at 18:45
@AbidRahmanK Not at all, provided the compiler is halfway decent. Things like `vget_low/high`, `vreinterpret` don't translate to additional instructions. That's how it is supposed to be at last. They are just like typecasting in C that largely doesn't cost any extra cycles. If you can write assembly, why bother with intrinsics? There is really no point since even the most recent compilers generate FUBAR machine codes. – Jake 'Alquimista' LEE Dec 11 '17 at 18:50
As per arm intrinsic guide, vget_low is mapped `DUP`, which takes `8 cc` as per Cortex-A57 optimization guide. Isn't it significant? Regarding, assembly v/s intrinsics, I am a little confused which one to choose. As per this video [https://youtu.be/NYFzidaS3Z4?t=1856], compiler can generate faster code from intrinsics. But many people say opposite. So I have been trying both intrinsics and asm, but I don't see much difference in performance for smaller examples. So I am clueless here. – Abid Rahman K Dec 11 '17 at 19:01
@AbidRahmanK Every assembly programmer chuckles upon hearing that compiler generated codes were faster: That's physically impossible. I don't mind these people keeping bragging their incompetency though. It merely means that the compiler generated codes are faster than THEIR lackluster assembly counterparts, to get the fact straight. And there you have your example. If `vget_low` gets translated to `vdup`, that particular compiler just sucks, because there is no other way around it. – Jake 'Alquimista' LEE Dec 11 '17 at 19:16
Mm. It seems I need to explore a little more before settling down. Actually, I found one more, `LD1 {V0.4S-V1.4S}` it doesn't have a corresponding intrinsic in the compiler I am using while intrinsic guide tells the opposite. By the way, thanks for your answer and comments. It helps a lot, especially at this beginning stage. (I will just keep this thread alive for one more day, just in case anyone else want to add any info further. If not, I will accept your answer straight away.) – Abid Rahman K Dec 11 '17 at 19:22
@AbidRahmanK You just brought up another point. If you for example want to load a 4x4 matrix into a `float32x4x4_t` variable, you have to read it with `vld1q` line by line in intrinsics while it's a single instruction in assembly `ld1 {v0.4s-v3.4s}, [x0], #64`. I wouldn't mind typing four lines in C, but the problem is that these four lines get compiled to four instructions on Android Studio(Clang 4.9). What a BS. – Jake 'Alquimista' LEE Dec 11 '17 at 19:30
1

@AbidRahmanK: gcc6.3 also seems to be unusably horrible for this. `vget_low_u8` / high does compile to actual `dup` instructions instead of just using the `d0` register that aliases `q0` for get_low. (https://godbolt.org/g/8sfYoS). If you try to avoid the vget_high (which can't optimize away on AArch64) by using `vaddl_high_u8` , then `vget_low` compiles to a store q0 / reload d0. (And if you add the low half first, then it also has to reload q0 again.) Jake's right: ARM compilers are still unusably dumb with intrinsics, unlike x86 where intrinsics pretty reliably compile to decent code. – Peter Cordes Dec 11 '17 at 20:24
@Jake'Alquimista'LEE: IDK why your IDE doesn't show `vaddl_high_u8`, but it's a real intrinsic. If you want the compiler to emit `uaddl2`, you should use it instead of `vget_high_u8` / `vaddl_u8`. It's documented in ARM's [PDF intrinsic guides at least](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf), and [this one](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0053c/IHI0053C_acle_2_0.pdf). GCC provides it in `arm_neon.h` for AArch64 (see godbolt link in previous comment). – Peter Cordes Dec 11 '17 at 20:30
@PeterCordes I really hope that x86 compilers are good enough with AVX intrinsics since x86 assembly isn't as easy as ARM. And no, even the most recent version of Android Studio doesn't know `vaddl_high_u8`. And here's another point against intrinsics: the alleged portability is non-existent. Writing static libraries in assembly is the most reliable and portable way, and likely the least head-ache as well. – Jake 'Alquimista' LEE Dec 11 '17 at 20:46
@PeterCordes BTW, could we message chat some time? How does it work on StackOverflow? – Jake 'Alquimista' LEE Dec 11 '17 at 20:49
Yes, gcc and clang for x86 are never this stupid when you use stuff like `_mm256_castsi256_si128` (the AVX equivalent of `vget_low`). It always compiles to zero instructions; they just use the XMM register with the same number as the YMM register it aliases. They're *far* from perfect and do some dumb stuff, though. e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68923 re: missed optimizations for the clumsy workarounds you have to use due to design limitations in SSE intrinsics. (such as https://stackoverflow.com/q/39318496/224132) – Peter Cordes Dec 11 '17 at 20:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/160967/discussion-between-peter-cordes-and-jake-alquimista-lee). – Peter Cordes Dec 11 '17 at 20:54
Oh, lots of discussion have happened. Bad luck for me I am in different time zone. @PeterCordes : Thanks for that "godbolt webiste". It is awesome. – Abid Rahman K Dec 12 '17 at 05:49
1

@AbidRahmanK: see https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output, especially the link to Matt Godbolt's CppCon2017 talk: [“What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”](https://youtu.be/bSkpMdDe4g4) – Peter Cordes Dec 12 '17 at 05:53
@PeterCordes: Wow, I just found that video and bookmarked it last night to watch later. Nice coincidence. Thanks. – Abid Rahman K Dec 12 '17 at 05:54
Can you execute the code also in goldbot site? Or is it just compile only? – Abid Rahman K Dec 12 '17 at 05:55
@AbidRahmanK: compile only, at least for now. There are some other online IDE sites that let you run code. (If Matt ever does add execution support, it will probably only be for x86, unless he sets up `qemu` or something to execute non-x86 binaries.) – Peter Cordes Dec 12 '17 at 05:57
The GCC implementation of `vget_low_u8 (x)` is not efficient. We're implementing it as `vreinterpret_u8 (vcreate_u64 (vgetq_lane_u64 (vreinterpretq_u64 (x))))` That's a few levels of indirection for the compiler to see through for what should be a simple bit extract. But what really confuses GCC is that you're planning to use the wider register again. That stops it from eliminating the obviously redundant `dup` operation. Please feel free to raise a bug report against GCC (though GCC 5 is no longer supported)! – James Greenhalgh Dec 13 '17 at 16:06
@JamesGreenhalgh That's my point. Why look for workarounds if you can give orders directly? And who does guarantee that my codes written in intrinsics will be fine on the next version of the particular toolchain? Why should I waste time writing in intrinsics and check the disassembly if the compiler didn't mess up anything everytime? All in all, writing NEON routines in intrinsics is pure nonsense. – Jake 'Alquimista' LEE Dec 13 '17 at 16:33
What do you do with your carefully tuned inline assembly when you move to a new microarchitecture, which doesn't have a pipeline structure like Cortex-A8 or Cortex-A9? This is one of the things we mean by portable. If the compiler could be trusted to generate efficient code, it could also allow you an easier route from "efficient for Cortex-A8" to "efficient for all Armv7-A/Armv8-A implementations" by recompiling with a new `-mtune=` value. Bugs like this don't help the reputation of intrinsics, but we (compiler writers) need to have them in our bug trackers if we're ever to start fixing them! – James Greenhalgh Dec 13 '17 at 16:42
@JamesGreenhalgh I don't use inline assembly. And codes I've written in assembly always mopped the floor with the intrinsics generated ones regardless of the compile options. And when it comes to NEON scheduling, I don't see how my A8 optimized codes can be better scheduled for A9 or A15. You may be right about the integer core, but you will see no difference in NEON instructions no matter how you change the compile options. – Jake 'Alquimista' LEE Dec 13 '17 at 17:41
@JamesGreenhalgh I suggest you to write a simple 4x4 matrix multiplication (float, complex) on `aarch32` and 8x8 on `aarch64`, both in intrinsics and assembly. Then you will realize that intrinsics is beyond any hope regardless of all the fancy compile options. – Jake 'Alquimista' LEE Dec 13 '17 at 17:47
That's not true, even for your simple code here turning off scheduling, or changing the `-mcpu` options modifies code generation; see https://godbolt.org/g/1BTsAz . Scheduling of Advanced SIMD instructions is as important as scheduling on the integer side. For some intrinsics (which are implemented as inline assembly blocks in `arm_neon.h`) I agree, GCC won't touch them. For most other intrinsics, GCC will schedule according to the model it has of your `-mcpu` value. If you are seeing bad code generation, please raise it on bugzilla - that's the only way we (the GCC developers) will see it. – James Greenhalgh Dec 13 '17 at 17:57
@JamesGreenhalgh I should be in bed now, but you made me curious. Yes, you are right, the generated machine codes are different depending on the -mcpu option. The questions are however, you don't use this option on big.LITTLE configurations, and more importantly, I don't think that these differences will bring any measurable gain in the end: they all look messed up to me. Slightly less dumb doesn't translate to smart. I also did my part, and wrote 4x4 matrix multiplication (float, complex) in intrinsics, assembly. – Jake 'Alquimista' LEE Dec 13 '17 at 21:56
@JamesGreenhalgh https://godbolt.org/g/CLPbeq The URL gets too long if I include the asm code I wrote, hence I had to remove it. I also compiled the intrinsics function with GCC 7.2.1 that deals vget_low/high halfway decently. Still, there are way too many junk instruction such as `vmov` for no particular reason. No, NEON intrinsics are still unusable even on the most recent GCC version 7.2.1. The moment the register bank spills, it's game over. You can convince yourself by clicking the link above. – Jake 'Alquimista' LEE Dec 13 '17 at 22:09
And just as I expected, the generated `aarch64` code compiled with GCC 7.2.1 is rather decent, much better than ARMv7 version. But I told you, it should have been a 8x8 multiplication on 64 bit due to the larger register bank on `aarch64` for a meaningful test. – Jake 'Alquimista' LEE Dec 13 '17 at 22:13
I agree that the AArch64 code is unusual with all the `dup` instructions, rather than using by element forms of `fmla` - that has a big impact on performance (that's a GCC bug I've tried to fix in the past). Your matrix multiply shows a big difference when scheduling for different cores. On big.LITTLE configurations we schedule for the little core. We expect the big core's Out-Of-Order execution capability to mean that scheduling is less important for it. I would hope that (in general) AArch64 intrinsics performance is OK. Really good assembler programmers will still beat the compiler. – James Greenhalgh Dec 14 '17 at 09:29
@JamesGreenhalgh I've been performing benchmarks on some codes with different architecture setting, but the results have all been within error range. The truth is that the difference in scheduling manages to hide some latencies here and there, at the cost of other interlocks elsewhere. I really question the usefulness of NEON intrinsics in the first place. They are nothing else than more or less direct translation from the instruction set, thus not easier than the instructions themselves. And more importantly, NEON assembly is so much easier than dealing with the integer core. – Jake 'Alquimista' LEE Dec 14 '17 at 10:04
@JamesGreenhalgh If you look at some recent questions here, you will notice that some people misunderstood `vld2` and `vld4` as 2x or 4x `vld1`. This wouldn't have happened if they weren't learning NEON by intrinsics. NEON intrinsics is more confusing to beginners, and less convenient and efficient than assembly to experienced ones. And it isn't as portable as some people might think. What's the point in using intrinsics? – Jake 'Alquimista' LEE Dec 14 '17 at 10:09

UADDL vs UADDL2 in Aarch64 NEON

NEON Assembly

NEON Intrinsics

UPDATE

1 Answers1