11

I want to convert an array of unsigned short numbers to float using SSE. Let's say

__m128i xVal;     // Has 8 16-bit unsigned integers
__m128 y1, y2;    // 2 xmm registers for 8 float values

I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Krishnaraj
  • 421
  • 1
  • 3
  • 10

2 Answers2

24

You need to first unpack your vector of 8 x 16 bit unsigned shorts into two vectors of 32 bit unsigned ints, then convert each of these vectors to float:

__m128i xlo = _mm_unpacklo_epi16(x, _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(x, _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • that's what I would do as well with except that I'd use one _mm_setzero_si128(), instead of two _mm_set1_epi16. – Magnus Jun 15 '13 at 21:16
  • @Magnus: I think you'll find that the generated code is the same either way, at least with most decent compilers. – Paul R Jun 15 '13 at 22:32
  • @PaulR Hi Paul. The optimizers tend to do something, just not the thing I would've done :-) In this case I found that at least MSVC folded _mm_set1_epi16(0) into a 16 byte constant which it loads using movdqa. It actually produces two constants with two separate movdqa instructions. – Magnus Jun 20 '13 at 12:15
  • 1
    I find that MSVC is a bit unreliable when it comes to SSE code generation/optimisation - sometimes it does OK and other times it fails miserably. gcc, ICC and clang all tend to be more consistent/reliable. – Paul R Jun 20 '13 at 14:30
  • 1
    @PaulR. There's been cases where I've been caught out by different compilers which have made me quite pessimistic, but indeed they do often manage to do the right thing. – Magnus Jun 20 '13 at 17:08
  • @PaulR, How could one convert one `__m128i` built from 16 elements each `unsigned char` into 4 `__m128`? Thank You. – Royi Oct 23 '17 at 20:47
  • @Royi: see [this answer](https://stackoverflow.com/a/19492053/253056) (and just ignore the final conversion to float). – Paul R Oct 23 '17 at 21:47
8

I would suggest to use a slightly different version:

static const __m128i magicInt = _mm_set1_epi16(0x4B00);
static const __m128 magicFloat = _mm_set1_ps(8388608.0f);

__m128i xlo = _mm_unpacklo_epi16(x, magicInt);
__m128i xhi = _mm_unpackhi_epi16(x, magicInt);
__m128 ylo = _mm_sub_ps(_mm_castsi128_ps(xlo), magicFloat);
__m128 yhi = _mm_sub_ps(_mm_castsi128_ps(xhi), magicFloat);

On assembly level the only difference from Paul R version is usage of _mm_sub_ps (SUBPS instruction) instead of _mm_cvtepi32_ps (CVTDQ2PS instruction). _mm_sub_ps is never slower than _mm_cvtepi32_ps, and is actually faster on old CPUs and on low-power CPUs (read: Intel Atom and AMD Bobcat)

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • 1
    I'm not entirely convinced this is better though. You take a 1-2 cycle latency hit for moving data from SSE-int to SSE-FP. Then you need two extra registers (or loads) for the two constants. This trick is more commonly used for double-precision. – Mysticial Feb 07 '12 at 00:36
  • 1
    CVTDQ2PS also suffers from SSE-INT to SSE-FP transition penalty. Increased register pressure could be a problem, but it is highly dependent on surrounding code. – Marat Dukhan Feb 07 '12 at 00:42