0

So I read the documentation of intel about _mm_blendv_ps but couldn't quite understand what the function really does. So i wrote the following code:

    __m128 a = { 18.0,4.0,19.0,21.0 };
    __m128 b = { 67.0,92.0,888.0,47.0 };
    __m128 mask = { 1.0,0.0,0.0,1.0 };

    __m128 result = _mm_blendv_ps(a, b, mask);
    cout << "Result is: " << result[0] << " " << result[1] << " " << result[2] << " " << result[4] << endl;

But I get the error "No operator [] matches these operands". Why cannot I access result? Isn't result a 32-bit float vector??

So why cannot I access result? How can I access it? And also what will result cout(what does blendv do)??

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Coffee
  • 13
  • 3
  • Is it a subscriptable vector? Consider [this answer](https://stackoverflow.com/questions/12624466/get-member-of-m128-by-index). – tadman Jan 15 '23 at 23:35
  • See [print a \_\_m128i variable](https://stackoverflow.com/a/46752535) for the printing part, except of course using a `alignas(16) float tmp[4]` array instead of `uint32_t [4]`. – Peter Cordes Jan 16 '23 at 03:38
  • 1
    Welcome to SO! In the future please do not mix multiple questions into one -- read the [tour] and [ask]. – chtz Jan 16 '23 at 09:59

1 Answers1

5

Blendv uses the highest set bit to select between two results. It's equivalent to this code:

__m128 _mm_blendv_ps(__m128 false_result, __m128 true_result, __m128 mask) {
   __m128 r;
   r[0] = (mask[0] & 0x80000000) ? true_result[0] : false_result[0];
   r[1] = (mask[1] & 0x80000000) ? true_result[1] : false_result[1];
   r[2] = (mask[2] & 0x80000000) ? true_result[2] : false_result[2];
   r[3] = (mask[3] & 0x80000000) ? true_result[3] : false_result[3];
   return r;
}

I actually tend to wrap this, because the argument ordering is a little different to the standard if(cmp) { true } else { false };

__m128 select(__m128 mask, __m128 true_result, __m128 false_result) {
   return _mm_blendv_ps(false_result, true_result, mask);
}

typically you would use this to perform if(a < b) {} else {} type operations, e.g.

// if (a < b) {return true_result;} else {return false_result;}
__m128 select_if_lt(__m128 a, __m128 b, __m128 true_result, __m128 false_result) {
   return select(_mm_cmplt_ps(a, b), true_result, false_result);
}

// if (a >= b) {return true_result;} else {return false_result;}
__m128 select_if_ge(__m128 a, __m128 b, __m128 true_result, __m128 false_result) {
   return select(_mm_cmpge_ps(a, b), true_result, false_result);
}

In the code you posted above:

    __m128 mask = { 1.0,0.0,0.0,1.0 };

The highest bit of 1.0 is actually zero, so you'd want a negative number in there to make the mask work, e.g.

    // it doesn't matter which negative number you use, 
    // it just requires the sign bit to be set. -0.0f works!
    __m128 mask = { -0.0f,0.0,0.0,-0.0f };

The nice thing about looking only at the sign bit is that you are able to perform certain if/else operations without needing to use a comparison instruction, e.g.

// if (a < 0) {return true_result;} else {return false_result;}
__m128 select_if_negative(__m128 a, __m128 true_result, __m128 false_result) {
    return select(a, true_result, false_result);
}

Beware though, that you will have a false positive for -0.0f, which may or maynot be important to you.

As for accessing the contents of an __m128, this isn't typically cross platform (some compilers overload the array operators, some specify .x/.y. etc, some have internal union member vars). So, if you want a way to access the contents in a cross platform method, you have 2 options:

  1. As correctly pointed out by Peter, don't use _mm_extract_ps, use _mm_cvtss_f32 with a shuffle.
std::ostream& operator << (std::ostream& os, const __m128& v) {
   os << "(" << 
         _mm_cvtss_f32(v) << ", " << 
         _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(1, 1, 1, 1))) << ", " << 
         _mm_cvtss_f32(_mm_unpackhi_ps(b, b)) << ", " << 
         _mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(3, 3, 3, 3))) << ")"; 
    return os;
}
  1. use _mm_store_ps
std::ostream& operator << (std::ostream& os, const __m128& v) {
   float f[4];
   _mm_storeu_ps(f, v);
   os << "(" << 
         f[0] << ", " << 
         f[1] << ", " << 
         f[2] << ", " << 
         f[3] << ")";
    return os;
}

However you do it though, accessing elements of the XMM register will always incur a cost (well, apart from [0]), so the general rule is to try to avoid doing this as much as possible!

robthebloke
  • 9,331
  • 9
  • 12
  • `_mm_extract_ps` is not the intrinsic you want; it returns the FP bit-pattern as an `int`, because it's the intrinsic for `extractps r/m32, xmm, imm`. See [Intel SSE: Why does \`\_mm\_extract\_ps\` return \`int\` instead of \`float\`?](https://stackoverflow.com/q/5526658) - you want `_mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(0, 0, 0, 2)));` to get element `2`. Or use `_mm_storeu_ps` and let the compiler optimize it into shuffles. – Peter Cordes Jan 17 '23 at 01:24
  • Ooops. Brain fog (shows how often I access vector elements!). Updated post (although unpackhi may be a better option here? Saves on a byte in the resulting machine code, and has the same latency/throughput on most CPU's afaik?). – robthebloke Jan 17 '23 at 02:10
  • Yes, correct. `movhlps` also saves a byte, and if we have a dead register to move into, can avoid a `movaps` register copy as well. `clang` optimizes shuffles for us so there's no point bothering with this, but for other compilers yeah using intrinsics for optimal asm shuffles is good. `movhlps` is also faster than `unpckhps` on really old CPUs with only 64-bit shuffle units (K8, and first-gen Core 2, Conroe), as fast as `movhpd` but smaller. [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) shows using it. – Peter Cordes Jan 17 '23 at 02:16
  • @robthebloke In the first code block, did you mean to use `mask` instead of `cmp` (`cmp` is not declared, while `mask` is unused)? – njuffa Jan 17 '23 at 02:17
  • This was for printing with `cout`, so I didn't bother worrying about performance; given that you might just optimize for code-size and go with `pshufd` to copy-and-shuffle. Actually, for `cout` specifically, where each output is a separate function call, it makes *much* more sense to save to an array, since x86-64 System V has no call-preserved XMM registers; it would have to reload the vector between calls to `ostream::operator<<` if it wanted to actually shuffle. Windows x64 does have some call-preserved XMM, too many in fact. – Peter Cordes Jan 17 '23 at 02:19