0

I recently followed a post/ blog to find the least member of an array and it used 128 bit vector instructions. I followed the post and it ran fine, until I decided to write the same for 256 bit instruction set.

The code is as follows -

#include<iostream>
#include<random>
//#include <Eigen/Dense>
#include <immintrin.h>
#include <cstdlib>
#include <vector>


float min128_sse(float *a, int n) {
    float res; 
    
    __m128 *simdVector = (__m128*) a;
    __m128 maxval = _mm_set1_ps(UINT32_MAX);

    for (int i = 0; i < n / 4; i++) {
        maxval = _mm_min_ps(maxval, simdVector[i]);
    }

    
    maxval = _mm_min_ps(maxval, _mm_shuffle_ps(maxval, maxval, 0x93));
    

    _mm_store_ss(&res, maxval);

    return res;
}


float min256_sse(float *a, int n) {
    float res;

    __m256* simdVector = (__m256*) a;

    __m256 minVal = _mm256_set1_ps(UINT32_MAX);

    for (int i = 0; i < n / 8; i++) {
        minVal = _mm256_min_ps(minVal, simdVector[i]);
    }


    minVal = _mm256_min_ps(minVal, _mm256_shuffle_ps(minVal, minVal, 0x93));
    
   res = minVal[0];

   std::cout<<res<<std::endl;

    return res;
}


int main()
{

std::vector<float> givenVector{1.0, 2.0, 3.0, 4.0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, -1, -2, -3, -4, -5, -6, -7, -8};

std::cout<<min128_sse(givenVector.data(), givenVector.size())<<std::endl;
std::cout<<min256_sse(givenVector.data(), givenVector.size())<<std::endl;

}

The code runs into segmentation fault in the following line -

minVal = _mm256_min_ps(minVal, simdVector[i])

From my basic understanding of SIMD instructions, _mm256_min_ps would operate on 256 bits at once as opposed to 128 bits in _mm_min_ps. If I am not encountering a segmentation fault in in my 128 bit version, I should not be facing it in the 256 bit version as well. The only changes that would be required would be the range of the for loop.
However the segmentation fault comes into picture even at i=0. I suppose there is a gap in my understanding of SIMD instructions. Can someone please highlight it.


Can someone also point out why the Segmentation fault is being thrown.

TIA

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Atharva Dubey
  • 832
  • 1
  • 8
  • 25
  • thank you for replying @Someprogrammerdude, I must admit I came acorss `__m128*` for the first time and I simply extrapolated it to `__m256*`. However you pointed out this for `__m256` and not `__m128`. Does it not give undefined behavior in the latter and if so why – Atharva Dubey Oct 12 '21 at 07:25
  • 2
    @Someprogrammerdude: Nope. `__m256` is a `may_alias` type; you can point a `__m256*` at *anything*, including `int32_t`. [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605). But normally you'd use `_mm256_load_ps()` which takes a `float*` arg. – Peter Cordes Oct 12 '21 at 07:26
  • 2
    Of course 256-bit `load` requires 32-byte alignment, since you didn't use `loadu`. Also, it takes more than one shuffle to find the min horizontally in one `__m256` vector. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) / [How to sum \_\_m256 horizontally?](https://stackoverflow.com/q/13219146) for efficient shuffles. (Use `min` instead of `add`) – Peter Cordes Oct 12 '21 at 07:27
  • `__m128 maxval = _mm_set1_ps(UINT32_MAX);` doesn't make much sense as an initializer for this. `(float)0xFFFFFFFF` rounds up by one to `4294967296.0f`, which is much smaller than `FLT_MAX` or `std::numeric_limits::max()`. Also, your var name is backwards; you're finding the min, not max. You're also assuming that `n` is a multiple of 4 (or 8), or that your array has padding with `FLT_MAX` or some other safe element up to a multiple of the vector width. – Peter Cordes Oct 12 '21 at 07:32
  • I am aware of that, it is just that during testing I knew it would not greater than `UINT32_MAX`. I was not able to remember what is defined for the maximum value of float and hence went with `UINT_MAX` as it was at the top of my head. I was working with finding maximum first and then changed it to find the min. – Atharva Dubey Oct 12 '21 at 07:35
  • Live demo of that `__m256` requires stricter alignment than what `vector` works with: https://godbolt.org/z/a17M8KT4d. Note that with `__m128` you are fine. – Daniel Langr Oct 12 '21 at 07:42

0 Answers0