6

What is the "correct" (i.e., portable) way in LLVM to load data from memory into a SIMD vector?

Looking at the typical IR generated by LLVM's auto-vectorizer for an x86 target, it seems like the pattern is:

  • bitcast a pointer to the scalar type (e.g., double *) to the corresponding vector type (e.g., <4 x double>*),
  • load from the converted pointer while taking into account alignment considerations (i.e., don't use the natural alignment of the vector type, but the alignment of the corresponding scalar type).

In the case of AVX, this pattern maps nicely to SIMD intrinsics such as _mm256_loadu_pd() and friends. However, I have no idea if this strategy would also be correct for other ISAs (e.g., Neon, AltiVec).

I haven't been able to find info on the topic in the LLVM docs. Am I missing something obvious?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
bluescarni
  • 3,937
  • 1
  • 22
  • 33
  • I think it depends on hardware: https://stackoverflow.com/a/45938112/126995 Not sure compilers are aware of these nuances. – Soonts Jul 26 '20 at 14:31

1 Answers1

1

Having spent some more time thinking about this, I believe that a portable solution may be the following:

  • load the scalar values one by one from memory in the usual (non-SIMD) way,
  • immediately build a vector with repeated insertelement instructions.

Similarly, in order to store the values in a SIMD vector to a memory location, extract the vector elements as scalars via the extractelement instruction and store them one by one.

In my experiments, the LLVM optimizer was always successful in recognising these patterns and fusing them into direct SIMD load/store instructions.

However, this strategy also results in a noticeable bloat in the size of the generated IR and subsequent degradation in compilation times. Hence, for the time being I'll stick to the direct bitcasting approach and perhaps implement this other approach as a fallback if the bitcasting method fails on specific setups.

bluescarni
  • 3,937
  • 1
  • 22
  • 33
  • I was thinking of looking into the CreateMaskedLoad() instruction in the IRBuilder for this purpose. Any idea if this would be worth looking into? Is this what the masked load/store instruction is meant for, or are they meant for something else? – Sanket_Diwale Jul 14 '21 at 23:05