Blendv uses the highest set bit to select between two results. It's equivalent to this code:
__m128 _mm_blendv_ps(__m128 false_result, __m128 true_result, __m128 mask) {
__m128 r;
r[0] = (mask[0] & 0x80000000) ? true_result[0] : false_result[0];
r[1] = (mask[1] & 0x80000000) ? true_result[1] : false_result[1];
r[2] = (mask[2] & 0x80000000) ? true_result[2] : false_result[2];
r[3] = (mask[3] & 0x80000000) ? true_result[3] : false_result[3];
return r;
}
I actually tend to wrap this, because the argument ordering is a little different to the standard if(cmp) { true } else { false };
__m128 select(__m128 mask, __m128 true_result, __m128 false_result) {
return _mm_blendv_ps(false_result, true_result, mask);
}
typically you would use this to perform if(a < b) {} else {}
type operations, e.g.
// if (a < b) {return true_result;} else {return false_result;}
__m128 select_if_lt(__m128 a, __m128 b, __m128 true_result, __m128 false_result) {
return select(_mm_cmplt_ps(a, b), true_result, false_result);
}
// if (a >= b) {return true_result;} else {return false_result;}
__m128 select_if_ge(__m128 a, __m128 b, __m128 true_result, __m128 false_result) {
return select(_mm_cmpge_ps(a, b), true_result, false_result);
}
In the code you posted above:
__m128 mask = { 1.0,0.0,0.0,1.0 };
The highest bit of 1.0 is actually zero, so you'd want a negative number in there to make the mask work, e.g.
// it doesn't matter which negative number you use,
// it just requires the sign bit to be set. -0.0f works!
__m128 mask = { -0.0f,0.0,0.0,-0.0f };
The nice thing about looking only at the sign bit is that you are able to perform certain if/else operations without needing to use a comparison instruction, e.g.
// if (a < 0) {return true_result;} else {return false_result;}
__m128 select_if_negative(__m128 a, __m128 true_result, __m128 false_result) {
return select(a, true_result, false_result);
}
Beware though, that you will have a false positive for -0.0f, which may or maynot be important to you.
As for accessing the contents of an __m128, this isn't typically cross platform (some compilers overload the array operators, some specify .x/.y. etc, some have internal union member vars). So, if you want a way to access the contents in a cross platform method, you have 2 options:
- As correctly pointed out by Peter, don't use
_mm_extract_ps
, use _mm_cvtss_f32
with a shuffle.
std::ostream& operator << (std::ostream& os, const __m128& v) {
os << "(" <<
_mm_cvtss_f32(v) << ", " <<
_mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(1, 1, 1, 1))) << ", " <<
_mm_cvtss_f32(_mm_unpackhi_ps(b, b)) << ", " <<
_mm_cvtss_f32(_mm_shuffle_ps(b, b, _MM_SHUFFLE(3, 3, 3, 3))) << ")";
return os;
}
- use
_mm_store_ps
std::ostream& operator << (std::ostream& os, const __m128& v) {
float f[4];
_mm_storeu_ps(f, v);
os << "(" <<
f[0] << ", " <<
f[1] << ", " <<
f[2] << ", " <<
f[3] << ")";
return os;
}
However you do it though, accessing elements of the XMM register will always incur a cost (well, apart from [0]), so the general rule is to try to avoid doing this as much as possible!