2

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.

In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.

Structure:

struct Cell {
  _m256i first;
  _m256i second;
};

An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.

I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?

Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.

Martinsos
  • 1,663
  • 15
  • 31
  • 1
    short answer: yes, you can create arrays, but compiler will probably not optimize the array out into registers. – Cory Nelson May 10 '15 at 14:31
  • How big is N ? Be aware of stack size limitations. – Paul R May 10 '15 at 16:19
  • @PaulR N will be about few hundreds! – Martinsos May 10 '15 at 21:55
  • @CoryNelson do you maybe have some idea how is compiler going to handle it then? Does it mean that there will be a lot of loading/storing? Is there some better practice for this then? – Martinsos May 10 '15 at 21:56
  • 1
    OK - pretty small then - that should not be a problem on a desktop or server OS. – Paul R May 10 '15 at 21:56
  • 2
    Don't worry about the loading/storing - you only have 16 registers anyway - let the compiler and the L1 cache take care of everything for now. – Paul R May 10 '15 at 21:57
  • 1
    Instead of thinking of having an arrays of AVX/SSE values think of it as having a [SoA or AosOA](https://stackoverflow.com/questions/30022824/what-is-this-structure-called-simply-soa/30029176#30029176) which is SIMD friendly. – Z boson May 11 '15 at 08:50
  • @Zboson could you elaborate more on that? What I actually have is an array of structures, where structure has two elements, both _m256i. I see in your answer that AoS is not good for SIMD, but I do not understand why? – Martinsos May 11 '15 at 08:57
  • Is this for 32-bit or 64-bit integers? – Z boson May 11 '15 at 09:11
  • @Zboson I am using the same algorithm for 8-bit, 16-bit and 32-bit integers (not mixed ofcourse), precision is chosen in runtime, and I fit is much of them as I can in the register. So I am doing precision-parallelization tradeoff. But if I would have to choose one to optimize for, I would pick 8-bit. I edited my answer with information about structure. – Martinsos May 11 '15 at 09:14
  • 1
    @Martinsos, `Cell` is a `SoA` when you store or read it (e.g. `typedef union __m256i { int8_t m256_i8[32]; int16_t m256_i16[16]; int32_t m256_i32[8]; } __256i;`). Then in `Cell prevColumn [N]` the array `prevColumn` is a AoSoA. – Z boson May 11 '15 at 09:56
  • @Zboson Ok, thanks! So __m256i is actually an array, and that is what makes Cell a SoA. I am somewhat confused: I though that __m256i is compiled into register (is this totally wrong)? – Martinsos May 11 '15 at 14:14
  • 1
    It's either in register or stored in memory. It's the same as any of the primitive data types .e.g `int` is either stored in a register (e.g. rdx) or stored in memory. The compiler takes care of this. Your question is analogous to asking if it's okay to make an array of `int`s (there are also 16 scalar registers just like 16 YMM registers). In memory you can think of `__256i` as `typedef union __m256i { int8_t m256_i8[32]; int16_t m256_i16[16]; int32_t m256_i32[8]; } __256i;` if you like. – Z boson May 12 '15 at 06:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77582/discussion-between-martinsos-and-z-boson). – Martinsos May 12 '15 at 08:08

1 Answers1

2

It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.

I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.

As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Peter, when you say "AVX doesn't fault on unaligned", do you mean an aligned load instruction (vmovdqa) can also handle unaligned addresses? The Intel software developers manual says: "When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction." However, I observed this effect in my code (aligned load instruction handling unaligned addresses fine), but didn't dare to exploit it. Any reference? – Ralf Jun 26 '15 at 14:41
  • I found a reference on my question in an Intel document on assembly language (by Kreitzer and Domeika): "Finally, for most instructions Intel® AVX lifts the restriction that vector loads and stores be aligned. Explicit “move aligned” instructions such as VMOVDQA still require addresses to be aligned on vector size boundaries. But other vector loads and stores can be unaligned." – Ralf Jun 26 '15 at 15:14
  • ... which leads to the following question: When the C compiler generates AVX instructions from intrinsics, it can probably take an explicit aligned load intrinsic like _mm256_load_si256() and turn this into a machine instruction (like an addition) directly using the address. This would entail the surprising behavior that suddenly data can be loaded from unaligned addresses. Is there any way to control this behavior (influence when the compiler generates vmovdqa and when it generates address arguments in subsequent instructions)? – Ralf Jun 26 '15 at 15:24
  • @Ralf: yup, folding `mm_load` intrinsics into memory operands for other insns leads to surprises, as you correctly guessed. e.g. http://stackoverflow.com/questions/30329235/avx-data-alignment-store-crash-storeu-load-loadu-doesnt. The solution is to always use the `loadu` intrinsic, so if the compiler wants to use a standalone load, it will use `movdqu` / `movups`. (Which has identical performance to `movdqa` when the address IS aligned, on CPUs new enough to support AVX.) – Peter Cordes Jun 27 '15 at 16:18