0

I use C and I want to apply some AVX2 code on 4 doubles. The operations are like these (per double):

  1. Access the "second 4 bytes" of the double as an int32 (something like that: ((union { double a; int32_t b[2]; }) {.a = XXX}).b[1] where XXX is the input double)
  2. Subtract an int32 constant c from our int32
  3. Cast the int32 to a double
  4. Multiply the double with some number z

I've tried to implement that by doing this:

  1. Read input vector in (unaligned doubles) and cast it to a int32 vector
  2. Load the constant c to a vector
  3. Subtract c from the int32 vector
  4. Convert int32 values to double: I was not able to do that :(
  5. Multiply doubles: Not done, but that should be trivial

My current code is roughly this:

    __m256i x = (__m256i)_mm256_loadu_pd(in);
    const __m256i c = _mm256_set1_epi32(1234);
    __m256i y = _mm256_sub_epi32(x, c); // we only care about every second value of our array; maybe that can be made more efficient?

    // tried to shuffle values so that the important int32 values are at the beginning. Maybe then casting can be done? 
    //__m256i z = _mm256_shuffle_epi32(y, _MM256_SHUFFLE(0, 2, 4, 6, 1, 3, 5, 7));

Maybe someone has an idea how I can cast the four int32 values to four double values? Also if there's a magic instruction that you know and which can improve other parts, please let me know.

Thanks a lot

Kevin Meier
  • 2,339
  • 3
  • 25
  • 52
  • 2
    I think you want [`_mm256_cvtepi32_pd`](https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions/intrinsics-for-conversion-operations-2/mm256-cvtepi32-pd.html) – Nate Eldredge Mar 05 '23 at 14:23
  • "Access the "second 4 bytes" of the double as an int32" is ok if done via a union, as first shown, but it produces undefined behavior if done via casting a `double *` to an integer pointer, as you seem to be saying you actually do. – John Bollinger Mar 05 '23 at 14:39
  • Why do you want to reinterpret the low four bytes as an `int32` rather than as an unsigned integer? Or is it the high four bytes you want? – Eric Postpischil Mar 05 '23 at 14:44
  • 2
    If it is the low four bytes you want, you might consider setting the high four bytes to 0x43300000 (the high bytes of the `double` value 2^52, which makes the ULP 1). If the low 32 bits are the unsigned integer `x`, setting the high bits to that forms the `double` value 2^52+`x`. Then do a `double` subtract of the `double` made with 0x43300000 in the high bytes and `c` in the low bytes, which forms the value 2^52+`c`. The result of that subtracting is (2^52+`x`)−(2^52+`c`) = `x`−`c`. But that takes `x` and `c` as unsigned, not signed (although the result is signed). – Eric Postpischil Mar 05 '23 at 14:56
  • 1
    Are `c` and `z` know at compile time? Or at least the same for all values? (If your input values are always positive, it sounds like you want to calculate a `log` approximation.) – chtz Mar 05 '23 at 18:08
  • @chtz yes, yes and yes . I did not mention the approximation itself, because it's not really relevant. – Kevin Meier Mar 05 '23 at 18:10
  • 1
    @JohnBollinger: `_mm256_loadu_pd` and `_mm256_loadu_si256` [are strict-aliasing safe](https://stackoverflow.com/q/52112605/224132), despite the `pd` intrinsic taking a `double*`. But I don't see any mention of pointer-casting in the question, only reinterpreting vectors like `_mm256_castpd_si256`. To set up for `_mm256_srli_epi64(v, 32)`. (Or to get the 4x `int32_t` elements into the low 128 bits, where `_mm256_cvtepi32_pd(__m128i)` needs then, probably `_mm256_permutevar8x32_epi32` (vpermd) or `_mm256_permutevar8x32_ps` (vpermps).) – Peter Cordes Mar 05 '23 at 20:25

0 Answers0