5

I'm trying to work with AVX instructions and windows 64bit. I'm comfortable with g++ compiler so I've been using that, however, there is a big bug described reported here and very rough solutions were presented here.

Basically, m256 variable can't be aligned on the stack to work properly with avx instructions, it needs 32 byte alignment.

The solutions presented at the other stack question I linked are really terrible, especially if you have performance in mind. A python program that you would have to run every time you want to debug that replaces instructions with their sub-optimal unaligned instructions, or over-allocating and doing a bunch of costly hacky pointer math in code to get proper alignment. If you do the pointer math solution, I think there is still even a chance for a seg fault because you can't control the allocation or r-values / temporaries.

I'm looking for an easier and cheaper solution. I don't mind switching compilers, would prefer not to, but if it's the best solution I will. However, my very poor understanding of the bug is that it is intrinsic to windows 64 bit, so would switching compilers help or do other compilers also have the same issue?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Thomas
  • 6,032
  • 6
  • 41
  • 79
  • Doesn't MinGW-w64 have a 32-bit compilation option? – CinchBlue Jun 19 '15 at 01:02
  • An extra 32 bytes and some simple pointer math isn't costly when compared to anything you would need 256-bit AVX instructions for. – Ross Ridge Jun 19 '15 at 01:04
  • @VermillionAzure 64-bit is pretty important to my application – Thomas Jun 19 '15 at 01:05
  • @RossRidge: That's not really relevant to this question. The underlying problem is that it's not safe to use AVX instructions in mingw-w64, because it apparently can't align the stack to 32 bytes because it isn't supported by the Windows x64 ABI. Therefore, if you use a `__m256` type and the compiler has to spill it onto the stack, you can end up with segmentation faults because it will try to use aligned instructions to move it to/from the stack. It seems like one fix on the compiler side would be to use unaligned moves in this case, but I don't know how feasible that change would be. – Jason R Jun 19 '15 at 01:37
  • @JasonR You've apparently completely misunderstood what I wrote. – Ross Ridge Jun 19 '15 at 01:54
  • 1
    @JasonR Even if you wrapped __m256 to get correct alignment with hacky code, the AVX intrinsics still return __m256, which means if you're doing code that requires the use of temporaries, theres always a chance the __m256 temporary would spill out of registers, onto the stack, and then the seg fault will happen, right? So this isn't even a real solution – Thomas Jun 19 '15 at 02:06
  • @RossRidge: I'm not sure what you meant, then. It sounds like you're advocating for some kind of manual implementation of the required alignment. Such hacks aren't really feasible in this case (as there are numerous manipulations of `__m256` instances, like temporaries, that only the compiler has control over), but if I'm misunderstanding your recommendation, perhaps you could clarify it. – Jason R Jun 19 '15 at 02:13
  • 1
    @Ragdoll: Exactly; there's no good solution to this problem achievable by just working around the issue in your source code. You would need some level of support at the compiler level to make this feasible. One potential solution would be for the compiler to emit unaligned move instructions when moving to/from the stack. That's essentially what the Python script that you linked does. Unfortunately, contemporary processors have a performance penalty for unaligned 256-bit moves (although 128-bit unaligned moves have been full-speed since the Nehalem architecture). – Jason R Jun 19 '15 at 02:16
  • I assume this is undesirable for other reasons, but one obvious workaround would be to pass the affected variables by reference rather than by value. – Harry Johnston Jun 19 '15 at 02:32
  • @JasonR I was only pointing out an inconsistency in the question. If he has a real use for AVX instructions then costs he was bitterly complaining about are insignificant. Indeed, pretty much those same costs will be paid by having the compiler align the stack automatically. – Ross Ridge Jun 19 '15 at 02:32
  • @RossRidge: Agreed. If there was a robust way to implement the manual alignment then it would certainly be a viable solution for any proper application of SIMD instructions. – Jason R Jun 19 '15 at 02:34
  • 2
    Both the Microsoft and Intel compilers manually align the stack at the start of each function call that uses AVX. Why GCC doesn't do this might be [related to exception handling](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c4). – Mysticial Jun 19 '15 at 14:00
  • @Mysticial Any idea where clang stands on the issue? – Thomas Jun 19 '15 at 19:00
  • 2
    @JasonR, When you say `That's not really relevant to this question. The underlying problem is that it's not safe to use AVX instructions in mingw-w64, because it apparently can't align the stack to 32 bytes because it isn't supported by the Windows x64 ABI.` do you mean AVX isn't available with Windows? As it does? Also see Ross' answer - `Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack.` – Royi Apr 08 '18 at 13:00
  • @Royi Coming from the same GCC bug, MSVC and ICC for Windows don't actually align the stack itself. Instead, they clobber an extra register that points to an aligned portion on the stack. (`r13` in the case of ICC.) All local variables (as well as spilled ymm/zmm values) that require >16-byte alignment are then placed in this section. This also has nothing to do with MSVC and ICC using unaligned load/stores. They do that for a completely different reason (they unconditionally use unaligned access for everything). – Mysticial Apr 09 '18 at 22:17
  • @Mysticial, I really think your comment should go here - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 and here - https://github.com/Alexpux/MSYS2-packages/issues/1209 (Maybe also here - https://sourceforge.net/p/mingw-w64/mailman/message/34485783/). People trying to fix it could use your knowledge (I'm not an expert in those area). Thank You. – Royi Apr 10 '18 at 04:13

1 Answers1

4

You can solve this problem by switching to Microsoft's 64-bit C/C++ compiler. The problem is not intrinsic to 64-bit Windows. Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack.

Also Cygwin's 64-bit version of GCC 4.9.2 can give variables 32-byte alignment on the stack.

Clang for Windows also makes working executables with AVX, and is a good choice in terms of optimizing well.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112