SSE: convert short integer to float

Question

I want to convert an array of unsigned short numbers to float using SSE. Let's say

__m128i xVal;     // Has 8 16-bit unsigned integers
__m128 y1, y2;    // 2 xmm registers for 8 float values

I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use.

Paul R · Accepted Answer · 2012-02-06T14:51:26.263

24

You need to first unpack your vector of 8 x 16 bit unsigned shorts into two vectors of 32 bit unsigned ints, then convert each of these vectors to float:

__m128i xlo = _mm_unpacklo_epi16(x, _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(x, _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);

edited Feb 06 '12 at 14:51

answered Feb 06 '12 at 14:43

Paul R

208,748
37
389
560

that's what I would do as well with except that I'd use one _mm_setzero_si128(), instead of two _mm_set1_epi16. – Magnus Jun 15 '13 at 21:16
@Magnus: I think you'll find that the generated code is the same either way, at least with most decent compilers. – Paul R Jun 15 '13 at 22:32
@PaulR Hi Paul. The optimizers tend to do something, just not the thing I would've done :-) In this case I found that at least MSVC folded _mm_set1_epi16(0) into a 16 byte constant which it loads using movdqa. It actually produces two constants with two separate movdqa instructions. – Magnus Jun 20 '13 at 12:15
1

I find that MSVC is a bit unreliable when it comes to SSE code generation/optimisation - sometimes it does OK and other times it fails miserably. gcc, ICC and clang all tend to be more consistent/reliable. – Paul R Jun 20 '13 at 14:30
1

@PaulR. There's been cases where I've been caught out by different compilers which have made me quite pessimistic, but indeed they do often manage to do the right thing. – Magnus Jun 20 '13 at 17:08
@PaulR, How could one convert one `__m128i` built from 16 elements each `unsigned char` into 4 `__m128`? Thank You. – Royi Oct 23 '17 at 20:47
@Royi: see [this answer](https://stackoverflow.com/a/19492053/253056) (and just ignore the final conversion to float). – Paul R Oct 23 '17 at 21:47

score 8 · Answer 2 · answered Feb 07 '12 at 00:32

8

I would suggest to use a slightly different version:

static const __m128i magicInt = _mm_set1_epi16(0x4B00);
static const __m128 magicFloat = _mm_set1_ps(8388608.0f);

__m128i xlo = _mm_unpacklo_epi16(x, magicInt);
__m128i xhi = _mm_unpackhi_epi16(x, magicInt);
__m128 ylo = _mm_sub_ps(_mm_castsi128_ps(xlo), magicFloat);
__m128 yhi = _mm_sub_ps(_mm_castsi128_ps(xhi), magicFloat);

On assembly level the only difference from Paul R version is usage of _mm_sub_ps (SUBPS instruction) instead of _mm_cvtepi32_ps (CVTDQ2PS instruction). _mm_sub_ps is never slower than _mm_cvtepi32_ps, and is actually faster on old CPUs and on low-power CPUs (read: Intel Atom and AMD Bobcat)

answered Feb 07 '12 at 00:32

Marat Dukhan

11,993
4
27
41

1

I'm not entirely convinced this is better though. You take a 1-2 cycle latency hit for moving data from SSE-int to SSE-FP. Then you need two extra registers (or loads) for the two constants. This trick is more commonly used for double-precision. – Mysticial Feb 07 '12 at 00:36
1

CVTDQ2PS also suffers from SSE-INT to SSE-FP transition penalty. Increased register pressure could be a problem, but it is highly dependent on surrounding code. – Marat Dukhan Feb 07 '12 at 00:42

SSE: convert short integer to float

2 Answers2

Linked