4

When writing vectorized code, you sometimes want to perform memory-aligned operations.

So let's say I have an unsigned char[] that ends before a 16-byte boundary, but I want to load the entire 16-byte-aligned block at once (and then presumably mask off the data I didn't need).

What is the proper way to do this across Clang, MSVC, and GCC without triggering undefined behavior?

(Let's not assume any particular vector types or operations - I'm hoping for approaches that work equally well for unaligned __m256i as they do for unaligned unsigned int.)

This is something that's safely possible in asm, that libc functions like strlen already do. But I'm asking about how to do it on these 3 mainstream compilers. It's illegal at the language level, at least according to the ISO standard, but do they have extensions or define the behaviour of doing this?

user541686
  • 205,094
  • 128
  • 528
  • 886
  • note this is perfectly fine and not undefined at all *for the CPU*. In particular, if any of a 16-byte block is valid to read without crashing, then the whole block is. So you just have to trick the compiler. – user253751 Mar 21 '23 at 03:04
  • 1
    If the array has been defined (or allocated) with sufficient size, then there is no undefined behaviour according to the C++ standard. The way to achieve this would be to always define or allocate your arrays as a multiple of `16` bytes (e.g. for an integral type that is 2 bytes, the array size will be a multiple of `8`) and keep track of the "end" index. Then, to copy, calculate the amount to be copied by rounding up the index to the next multiple of 16 bytes. Obviously, your code needs to ensure that no "end" causes copying past the actual size. – Peter Mar 21 '23 at 03:24
  • 3
    Theoretically there can still be problems when crossing memory page boundaries. Say 8 bytes in one page and 8 in the next. If the second page is inaccessible you will trigger a protection fault at the CPU level. – Pepijn Kramer Mar 21 '23 at 03:37
  • @Peter that's beside the point of the question. This is about when don't have control over your input. – user541686 Mar 21 '23 at 04:32
  • 2
    @PepijnKramer: aligned reads for SIMD don't cross pages. They're N-byte reads for N-byte alignment. That's the case I'm dealing with here - the case where we know the only problem is language-level UB. – user541686 Mar 21 '23 at 04:33
  • @user541686 Point is, if you want to avoid undefined behaviour in the sense of the C++ standard, then it is essential that you validate input and enforce constraints. If you don't (presumably in the name of some performance measure) then you're relying on compiler-specific hacks (in your case, with multiple compilers and - if you go that far and/or your code is eventually long-lived enough - multiple compiler versions), processor (family) specific behaviours, or both. – Peter Mar 21 '23 at 09:16
  • Re: "Let's not assume any particular vector types or operations" -- the best solution likely depends on both. E.g., for `__m128d`, the remaining elements is at most 1 scalar. And for many operations, you can simply do overlapping loads/stores (if input==output, make sure to load the last input before storing the last-but-one output). And if the input is large enough, it might not even matter, if you handle the last few elements inefficiently. – chtz Mar 21 '23 at 09:22
  • It's probably C++ UB to read outside the buffer at all, unless you consider mainstream compilers like GCC/clang as defining the behaviour for intrinsics. Hmm, maybe this isn't an exact duplicate because you're asking for code to actually do this, not just discussion of what would be safe. One strategy is to check that you're not within 16B of the end of a page (perhaps with `((uintptr_t)p & -4096u) <= (4096u-16)`), and if so then just load from the start of your buffer, otherwise fully bail out of SIMD since x86 doesn't have good variable-count vector byte shifts except via a LUT for pshufb – Peter Cordes Mar 21 '23 at 11:06
  • @PeterCordes: yeah the closure is frustrating, this is not a duplicate. I already know it's unsafe at the language level but safe at the hardware level. I didn't need a discussion on that, that's what prompted me to ask in the first place. I specifically asked for solutions to MSVC/Clang/GCC because it's obviously nonstandard. Re: your solution, I'm pretty sure it's UB on those 3 compilers and not a workaround. (Let's leave the "bail out of SIMD" case out of the discussion btw, it derails the question.) – user541686 Mar 21 '23 at 13:13
  • @chtz: it may not matter, or it may. We're talking about the case where it may. Even the "duplicate" question (which isn't one) refers to cases where it matters. Assume it matters or I wouldn't have asked the question. – user541686 Mar 21 '23 at 13:15
  • @Peter: Yes, compiler specific solutions is the point of the question. – user541686 Mar 21 '23 at 13:15
  • Are you talking about the case where your `unsigned char[]` is aligned by 16? Because if it's not, loading from the start of it will require an unaligned load; then you need to check if that load crosses into the next page, unless you can check a length and find out that there is a valid element in the next page as well. (Not an implicit-length C string) – Peter Cordes Mar 21 '23 at 13:24
  • @Peter: Yes, I know. If the array head is unaligned then I can round down and do an aligned read from earlier. If the array tail is unaligned then I can round up and to an aligned read until later. So whether aligned or not it's completely beside the point of the question, which is about the C++ language and not the x86 machine - I'm well aware how to avoid crossing pages. I'm just trying to figure out how to do these extra reads without language level UB. I only mentioned the tail since the problem is the same as for the head. – user541686 Mar 21 '23 at 16:13
  • I think it's at least de-facto well defined behaviour with the mainstream compilers, although I haven't looked at `clang -fsanitize=undefined -O2` to see if it would insert checks for e.g. a small global struct (so definitely not just one element of an array). And I suspect a decent amount of real-world code does it in practice, although that hasn't stopped compiler devs from making changes that broke NULL-pointer checks or code that relied on signed-integer overflow wrapping. – Peter Cordes Mar 21 '23 at 16:25
  • I suspect optimizations based on array-overread UB might remove any guarantee of getting a certain value from some other object that happened to be nearby, but I expect wouldn't break the values you get from the object the pointer itself was derived from. But of course it's a valid concern that a compiler might notice UB and treat it as `__builtin_unreachable()`, or something like that. – Peter Cordes Mar 21 '23 at 16:25
  • If you want C++ conformance (that is, as far as vector intrinsics go - vector types are not conforming because they need to violate strict aliasing rules) then your only option is to ensure that the array size is always a multiple of the vector size. If this is not possible then you must explicitly write tail processing, where the number of tail elements is less than the vector size. Otherwise, you would be relying on your compiler behavior, down to the particular combination of the compiler version, options and surrounding code. For anything beyond a toy project, such code would be broken. – Andrey Semashev Mar 22 '23 at 16:08
  • @AndreySemashev: I'm not asking for full C++ conformance. I'm just asking for code that will reliably work on the 3 major compilers I've listed. They can use nonportable compiler-specific tricks as long as they don't kill the optimizer and as along as I can achieve the same effect on all 3. – user541686 Mar 22 '23 at 19:21
  • @user541686 You either want a conforming program, in which case it will work with any conforming compiler, or you're not writing C++ and use assembler. Because, as I said before, each compiler has multiple versions, and there are multiple options each of them supports, and there are a million other things that may affect code generation and break your code. There aren't compiler-specific extensions that allow loads/stores beyond the end of an array, if that's what you're asking. – Andrey Semashev Mar 23 '23 at 20:14

0 Answers0