Alignment requirements for uint8x16_t being loaded from byte array?

Question

We have an assert firing under Debug builds that checks for alignment. The assert is for a byte array that's loaded into a uint8x16_t using vld1q_u8. While the assert fires, we have not observed a SIG_BUS.

Here's the use in code:

const byte* input = ...;
...

assert(IsAlignedOn(input, GetAlignmentOf(uint8x16_t));
uint64x2_t message = vreinterpretq_u64_u8(vld1q_u8(input));

I also tried with the following, and the assert fires for the alignment of uint8_t*:

assert(IsAlignedOn(input, GetAlignmentOf(uint8_t*));
uint64x2_t message = vreinterpretq_u64_u8(vld1q_u8(input));

What are the alignment requirements for the byte array when loading it into a uint8x16_t with vld1q_u8?

In the above code, input is a function paramter. IsAlignedOn checks the alignment of its two arguments, ensuring the first is aligned to at least the second. GetAlignmentOf is an abstraction that retrieves the alignment for a type or variable.

uint8x16_t and uint64x2_t are 128-bit ARM NEON vector datatypes that are expected to be placed in a Q register. vld1q_u8 is a NEON pseudo instruction that is expected to be compiled into VLD1.8 instruction. vreinterpretq_u64_u8 is an NEON pseudo instruction that eases use of the datatypes.

@Olaf - I'm not sure you are correct. They are intrinsics, [which are a C language extension](http://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html). The cited GCC doc refers to the ARM documents, so you should have both references if you want to read about them. — jww, May 28 '16 at 18:43
Please provide a reference to where the C standard allows a syntax like `GetAlignmentOf`! Re your edit: provide a [mcve] with the declaration of the variable `uint8x16_t`. And the alignment of a byte array is defined to be `1` by the standard. — too honest for this site, May 28 '16 at 18:45
Why not use the standard `_Alignof` resp `_Alignas` operators? — too honest for this site, May 28 '16 at 18:50
@Olaf - our sources are used to build for multiple platforms and multiple compilers. The compilers include GCC, Clang, MSVC. The platforms include Linux, Windows Phone and Windows Store. `GetAlignmentOf` is just an abstraction. (Many folks don't realize Microsoft compilers consume ARM intrinsics). — jww, May 28 '16 at 18:55
This is almost entirely compiler-specific, as it depends on exactly how they implement the intrinsic types and whether or not they want to add the alignment hint to the underlying instructions. From what I've seen, GCC never emits the hint even when alignment is guaranteed; Clang tends to do so wherever it can; no idea about MSVC. As to whether any of them implement the vector types properly or just typedef them to something like `struct {long long[2]}` (with resulting overly-strict alignment) I've never looked. — Notlikethat, May 28 '16 at 19:16
I gave you a hint already. no need to be rude. `_Alignas` etc. are standard C. Just use a modern compiler which is not stuck with a 27 year old version of the standard. — too honest for this site, May 28 '16 at 21:18
@Olaf - `_Alignof` and `_Alignas` are C11 extensions. Its not as simple as *"Just use a modern compiler"*. Our governance dictates we support compilers dating back to Visual Studio .Net 2002 and GCC 3.2. We don't subscribe to the "lets abandon 6th month old software" development model pioneered by companies like Apple and Microsoft, or warez like Browsers. — jww, May 28 '16 at 23:02
Hang on, is `IsAlignedOn()` checking the address _of_ `input`, or the address _pointed to by_ `input`? Without seeing a definition it's not clear whether the code here is even doing the right thing. (but if it is actually a wrapper for checking the _value_ of a pointer than it's a hideously ambiguous name) — Notlikethat, May 28 '16 at 23:27
They are not extensions, but part of the only valid version of standard C. And sticking to old rubbish is not the same as ensuring high coding quality. I somewhat doubt these old compilers actually do support the modern features you ask for. Anyway, you did ask about C, which implies standard C, thus C11. Anything else is **not** standard C — too honest for this site, May 28 '16 at 23:28
@Olaf - Unfortunately, we don't share the same views. Also, I tagged with C, and not C11. You removed the C tag, and the C11 tag was *never* present. You seem to be the only person claiming C11 here (even after we we gave you the broad support matrix). — jww, May 29 '16 at 19:14
1) Sorry for removing the C tag, that was wrong. 2) The C tag implies standard C, which currently **is** C11. (See the info). If you use something outdated, use the appropriate tag. 3) "We"? Pluralis majestatis? — too honest for this site, May 29 '16 at 19:42
@Olaf - "We" are the [Crypto++ project](http://www.cryptopp.com/). If you'd like us to abandon past compiler support, then you should raise an issue on our bug tracker or mailing list. I'll continue to tag with the generic C and C++ tags until I have a specific question about C89, C99, C11, C++03, C++11, etc. At that time, I will place the specialized tag. — jww, May 29 '16 at 22:24
As you refuse to tag correctly for pre-C99, complaining about getting tips with C11 features is somewhat strange then. — too honest for this site, May 29 '16 at 23:35
@Olaf - We've told you the versions of compilers we support. Your unwillingness to accept it does not matter to me or the project in the least bit. What would you like to argue about next? — jww, May 30 '16 at 00:02

score 5 · Answer 1 · answered May 28 '16 at 20:09

5

The natural alignment of a VLD1.8 instruction, loading 16 bytes to a Quad register, is a byte. This means that even if unaligned transfers are not permitted, this instruction cannot fault.

So it looks like this specific assertion is not correct.

answered May 28 '16 at 20:09

Dric512

3,525
1
20
27

Although `VLD1.8 ..., [Rn:64]` can certainly fault even under the normal unaligned access model. – Notlikethat May 28 '16 at 20:23
@Dric512 - Is it a `byte` or a `byte*`? I think the difference is 1 and 4. At this point, about all I know is its not a `uint8x16_t` because I'm not seeing a `SIG_BUS` for lack of 16-byte alignments. I checked in a [patch to back-off the assert to an `uint8_t*`](http://github.com/weidai11/cryptopp/commit/b86f3fef8716436705b2963baea350beebb1d790), so I should have some results soon from [our test script](http://github.com/weidai11/cryptopp/blob/arm-neon/cryptest.sh). – jww May 28 '16 at 21:01
Normally one uses such optimisations to speed up the code. Unaligned accesses actually do the opisite on many platforms. E.g. they might be broken into accesses for 1/2/4/... byte-wide accesses. – too honest for this site May 28 '16 at 21:17
@Dric512 - OK, so its not the `byte*` or `uint8_t*` either. The assert is still firing; but its not `SIG_BUS`'ing, either. Also, this is showing up on ARMv8 and GCC 4.9, so the alignment could be 8. – jww May 28 '16 at 23:07
@jww. Instead of an actual `assert`, maybe test with something that lets you print out more debug info at that point. e.g. print the actual address you were going to load from. Or just trigger the assert with a debugger attached. Then you wouldn't have to guess what the alignment was, and moreover you could see which function was passing in unaligned pointers. – Peter Cordes May 29 '16 at 03:26
@PeterCordes - Good suggestion, thanks. I think part of the problem is the rules mildly change with respect to traditional alignment when working MMX or NEON coprocessors. I know GCC will fixup data that is traditionally aligned but unaligned for MMX or NEON. – jww May 29 '16 at 22:43
Unless you specifically add alignment enforcement to the instruction (as with Notlikethat's comment), NEON load/store instructions are unaligned, with performance issues maybe on some implementations. Pointers are typically unaligned. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731171041.html. It seems like this macro is incorrect. After doing some Googling, I found some source code from crypto (http://btc.yt/lxr/satoshi/source/cryptopp/misc.h?v=0.3.20.01_closest#0334) that looks like it defines this macro, and it is completely incorrect for most platforms. – Peter M May 31 '16 at 22:07
`uint8x16_t` is mostly equivalent to `byte[16]`, fitting a Quad register. @Peter M: Even if unaligned transfers are not permitted, this will not abort (Unless the form `Rn:64` is used, as described by @NotLikethat) – Dric512 Jun 01 '16 at 19:22

score 5 · Answer 2 · answered May 31 '16 at 06:07

When writing direct assembler (either inline or in external files) you can choose whether you want to specify the alignment (e.g. vld1.8 {q0}, [r0, :64]) or leave it out (e.g. vld1.8 {q0}, [r0]). If it isn't specified, it doesn't require any specific alignment at all, as Dric512 says.

When using vld1q_u8 via intrinsics, you don't ever actually specify the alignment, so as far as I know, the compiler doesn't assume it, and produces the instruction without alignment specification. I'm not sure if some compilers can deduce some cases where alignment actually is guaranteed and use the alignment specifier in those cases. (Both gcc, clang and MSVC seem to produce vld1.8 without alignment specifiers in this particular case.)

Do note that this is only an issue on 32 bit arm; in AArch64, there's no alignment specifier to the ld1 instruction. But even there, alignment still obviously helps, you'll get worse performance if you use it with unaligned addresses.

Notlikethat · Answer 3 · 2016-06-02T07:29:21.193

Looking at this from the other end, here's an actual definition of that type from the point of view of one example compiler (Visual Studio 2015's arm_neon.h):

typedef union __declspec(intrin_type) _ADVSIMD_ALIGN(8) __n128
{
     unsigned __int64   n128_u64[2];
     unsigned __int32   n128_u32[4];
     unsigned __int16   n128_u16[8];
     unsigned __int8    n128_u8[16];
     __int64            n128_i64[2];
     __int32            n128_i32[4];
     __int16            n128_i16[8];
     __int8             n128_i8[16];
     float              n128_f32[4];

    struct
    {
        __n64  low64;
        __n64  high64;
    } DUMMYNEONSTRUCT;

} __n128;

...

typedef __n128   int8x16_t;

So, on Windows platforms at least, it's going to require no less than the alignment of an __int64 thanks to that union, and from the AAPCS that means 8 bytes (and even without a not-very-challenging guess at what _ADVSIMD_ALIGN(8) could possibly mean...)

It's even more straightforward than that, though, because it turns out said AAPCS does actually have the last word in this directly, via its definition of vector types in terms of containerized vectors (§4.1.2):

The content of a containerized vector is opaque to most of the procedure call standard: the only defined aspect of its layout is the mapping between the memory format (the way a fundamental type is stored in memory) and different classes of register at a procedure call interface.

In other words, at the ABI level a vector type is a vector type, regardless of what may or may not be in it, and both 64-bit and 128-bit containerized vectors require 8-byte alignment because the ABI says so (§4.1). Thus regardless of what the underlying instructions might be capable of, the Microsoft implementation isn't even being overly strict as I initially surmised, it's simply conformant. Eight shall be the number thou shalt align, and the number of the aligning shall be eight.

The argument to vld1q_u8(), on the other hand, is a uint8_t const *, whose pointed-to data has no alignment requirement, thus asserting that it meets 8-byte alignment can be expected to fail quite a lot.

Isn't this slightly orthogonal? This is about what alignment an int8x16_t has when stored somewhere, but in most cases, you'd want it to only stay in a NEON register. This doesn't affect the case that you can load data into it from any pointer pointing to an unaligned address using `vld1q_u8()`, which I believe was what the OP asked about. — mstorsjo, Jun 02 '16 at 05:46
@mstorsjo The other answers address the _direct_ question pretty well already. I thought it seemed worth also clarifying exactly why the code being asked about is faulty, although it seems I did leave the conclusion entirely implicit - fixed! — Notlikethat, Jun 02 '16 at 07:28
Notlikethat and mstorsjo - I think these are the controlling document from ARM: [VLDn and VSTn (single n-element structure to one lane)](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHCADCI.html), [VLDn (single n-element structure to all lanes)](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHCADCI.html) and [VLDn and VSTn (multiple n-element structures)](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHCADCI.html). I could not re-find them when I wanted to very things (and I can't ask for a reference on SO). — jww, Jun 02 '16 at 23:36
The ABI is referring to the alignment of the vector types such as int8x8_t when they are accessed as such types. This occurs if you pass them as parameters to functions, or include them in struct definitions. In the case of vld1_s8, the argument is a normal C int8_t*, which has the normal C rules for its alignment — Charles Baylis, Jun 05 '16 at 00:35

Alignment requirements for uint8x16_t being loaded from byte array?

3 Answers3