Converting bytes to floats using simd

Question

I've got array of bytes and I want to cast them to float and do some arithmetic with them. I've got idea, but I am not sure, if it has right to work. Maybe its not perfect, but if its possible to make it work then it would be enough for me, since I am starting to learn.

I tried to unpack low data twice. First mixing it with vector of 0 in xmm1 converting byte to short and doing it again converting from short to word. then I convert it to float. In my head it should work, looking at debugger I m not sure. I dont know why hexadecimal values are the ones that I expect, but values are always maximum short. Why does it happen?

PUNPCKLBW  xmm1, [rbx]
movups   xmm2,xmm1
xorps xmm1,xmm1
PUNPCKLBW  xmm1,xmm2
CVTDQ2PS xmm1,xmm1
movups   [vecReal],xmm1

Don't post pictures of text; copy/paste your debugger's output. That image is basically illegible and looks more like like a weird \hrule divider until you look closely enough to realize there's text in the blue. — Peter Cordes, Jan 10 '22 at 14:28
You could optimize away the `movups` register-copy by simply zeroing XMM2 instead. Are you sure you want to unpack bytes to words again, after already unpacking to words? May not matter if it's just zeros, but seems less intuitive than `punpcklwd` would be, if you can't just use SSE4.1 — Peter Cordes, Jan 10 '22 at 14:31
Related: [Loading 8 chars from memory into an \_\_m256 variable as packed single precision floats](https://stackoverflow.com/q/34279513) shows the SSE4.1 way (using an AVX2 version of it.) [Move single byte from memory to xmm register as float](https://stackoverflow.com/q/51582310) shows the SSE4.1 way as an afterthought. [SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers](https://stackoverflow.com/q/29856006) shows the reverse direction, float packing back to byte. — Peter Cordes, Jan 10 '22 at 14:36
[How to load a pixel struct into an SSE register?](https://stackoverflow.com/q/12121640) uses intrinsics to convert bytes to uint32_t, which of course is input ready for `cvtdq2ps`. [8-bit FFT for CPU architectures?](https://stackoverflow.com/q/16066107) uses `punpcklbw` / `punpcklwd` with the same zeroed register. (And only does a 4-byte load instead of 16-byte.) — Peter Cordes, Jan 10 '22 at 14:36

Aki Suihkonen · Answer 1 · 2022-01-10T15:30:07.717

1

What is happening is that the first parameter of unpack should contain the lower byte to combine, while the second parameter should contain the higher byte.

// lo = 01 02 03 ff xx xx xx xx
// hi = 00 00 00 00 xx xx xx xx
PUNPCKLBW lo, hi
// lo   01 00 02 00 03 00 ff 00 ... == 0x0001 0x0002 0x0003 0x00ff
PUNPCKLBW hi, lo
// hi   00 01 00 02 00 03 00 ff ... == 0x0100 0x0200 0x0300 0xff00

What you are probably after is the first order of arguments.

On SSE4.1 there's also pmovzxbd (or _mm_cvtepu8_epi32 as in intrinsic) which can convert 4 uint8_t to __m128i in a single instruction.

edited Jan 10 '22 at 15:30

answered Jan 10 '22 at 12:19

Aki Suihkonen

19,144
1
36
57

This is a pure asm question; the instruction is `pmovzxbd` for the `_mm_cvtepu8_epi32` intrinsic. – Peter Cordes Jan 10 '22 at 14:29

Converting bytes to floats using simd

1 Answers1