Intel Intrinsic: Load interleaved float32

Question

My RAM contains the following interleaved data: float32 Real1, float32 Imag1, float32 Real2, float32 Imag2, ... , float32 Real4, float32 Imag4

I have to load into __m128: Real1, Real2, Real3, Real4 and into another __m128: Imag1, Imag2, Imag3, Imag4

Is it possible ?

Not very efficiently, e.g. you'd have to use `shufps` (`_mm_shuffle_ps`) twice to get the output you want from two input vectors. If possible, store your data deinterleaved, e.g. an array of real parts and an array of imaginary parts, not one array of `complex double` elements. See [Complex Mul and Div using sse Instructions](https://stackoverflow.com/q/3211346) for an example of the shuffling required for complex-multiply of a single vector to get the output back into that inconvenient format. — Peter Cordes, Apr 12 '23 at 06:00
Hi Peter. Thank you very much. In case I have to run FIR (for example) on float32 complex data, it will be more efficient to store the input in 2 different buffers: real, imaginary. Am I right ? — Zvi Vered, Apr 12 '23 at 06:29
Yes, almost anything you want to do with a bunch of complex float data is more efficient with separate real[] and imag[] arrays, except maybe scattered access to only a small amount of your total data where you'd actually benefit from the spatial locality of keeping them together. It's the same as with x[], y[], z[] geometry vectors instead of `struct{x,y,z}[]`, as discussed in https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ In general, SoA (struct of arrays) instead of AoS (Array of Structs) — Peter Cordes, Apr 12 '23 at 06:33
I just finished a signal processing project in ARM (cortex A53). This architecture has up to 128bit intrinsic operations but also for interleaved data. I wonder why Intel has no such mnemonics. — Zvi Vered, Apr 12 '23 at 07:49
I've read that ARM `ld2` loads for interleaved data aren't as efficient, at least on some cores. But also, one uop on Intel microarchitectures can only ever write a single register, vs. ARM having multiple instructions that write 2 registers (like some shuffles, load-pair, and ld2). Since Intel designed their vector instruction sets after the P6 microarchitecture (which their current CPUs still are still somewhat based on), it makes sense they'd mostly introduce instructions that can run as a single uop, thus no two-register loads. — Peter Cordes, Apr 12 '23 at 08:04
Just wanted to note that for questions like this, I think the best option is usually to just look at what clang does with `__builtin_shufflevector`. For example: https://godbolt.org/z/cG6o9z67x. You can see that the compiler generates shufps instructions (and the values it uses), then if you need to you can head over to the [Intel Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) and search for the instruction to figure out the intrinsic name. — nemequ, Apr 12 '23 at 22:21

Intel Intrinsic: Load interleaved float32

0 Answers0