How to transfer single float from memory into four floats in XMM?

Question

The following code aims to divide each packed single floating-point value by 4:

quarter dd 0.25
...

movups  xmm1, [quarter]
mulps   xmm0, xmm1

However, it does not perform such operation as wanted, since data from [quarter] is taken as 16 bytes entity:

(gdb) p $xmm1
$2 = {v4_float = {0.25, 0.00200051093, 7.8472714e-44, 8.40779079e-45}

The obvious workaround would be to declare quarter as four elements array, however I am curious, if there is some preffered way to either transfer or replicate first element? For instance:

movss   xmm1, [quarter]
; some magic kung-fu
mulps   xmm0, xmm1

Edit:

Thanks to the comments below, I ended up with shufps:

movss   xmm1, [quarter]
shufps  xmm1, xmm1, 0     ; broadcast the least significant element
mulps   xmm0, xmm1

If you have AVX, you can use the broadcast. If not, you can load one element to the lower 32-bit, with `movd`, then shuffle with mask 0 to set all to the same value of XMM[31-0] — Yan Zhou, Dec 11 '16 at 23:01
In nasm I usually just say `times 4 dd 0.25`. It makes the data section larger but save one instruction. Even with AVX, `movups` compared to broadcast save 1 cycle, I believe. Of course, it depends on if you are optimizing for speed or code size. In my case, it simply does not make a noticeable difference and I like the convenience. — Yan Zhou, Dec 11 '16 at 23:03
@YanZhou In what way is `movd` preferable to the type-correct `movss`? On some CPUs it might matter. — Iwillnotexist Idonotexist, Dec 11 '16 at 23:07
@IwillnotexistIdonotexist I don't think `movd` is preferable to `movss`. It was merely the first I thought of. They do have different latency on some CPUs. What I was trying to point out was the use of shuffle as poor man's broadcast in SSE — Yan Zhou, Dec 11 '16 at 23:13
@IwillnotexistIdonotexist Generally, it's only important to get typecorrectness right for floating point vs. integer types, anything else shouldn't matter for performance. — fuz, Dec 11 '16 at 23:29
@fuz: yes exactly. `movd` is an integer-domain load, so it has extra latency as the input to an FP multiply or shuffle on some CPUs (Nehalem). Data types don't matter for stores on any known CPUs, though, so some compilers can and do use `movups` / `movaps` for stores to save a byte of code vs. `movdqa` for loads. (But most compilers still use `movapd` instead of `movaps`, even though no existing hardware cares about that). — Peter Cordes, Dec 13 '16 at 02:10
Also, yes, load + shufps is how compilers implement `_mm_set1_ps(something_from_memory)`. Most compilers choose to expand constants so they can be used directly from memory, though. (16B-aligned and already broadcast). If you're just doing it once ahead of a loop, keeping your constants compact with broadcast-loads, PMOVZX, and similar stuff like that might be worth it (especially if your constants don't all fit in a single cache line). Or even generate the bit-patterns on the fly if they're simple; see Agner Fog's Optimizing Assembly guide, and http://stackoverflow.com/q/35085059/224132 — Peter Cordes, Dec 13 '16 at 02:33

How to transfer single float from memory into four floats in XMM?

0 Answers0