7

I've been trying my hand at optimising some code I have using microsoft's sse intrinsics. One of the biggest problems when optimising my code is the LHS that happens whenever I want to use a constant. There seems to be some info on generating certain constants (here and here - section 13.4), but its all assembly (which I would rather avoid).

The problem is when I try to implement the same thing with intrinsics, msvc complains about incompatible types etc. Does anyone know of any equivalent tricks using intrinsics?

Example - Generate {1.0,1.0,1.0,1.0}

//pcmpeqw xmm0,xmm0 
__m128 t = _mm_cmpeq_epi16( t, t );

//pslld xmm0,25 
_mm_slli_epi32(t, 25);

//psrld xmm0,2
return _mm_srli_epi32(t, 2);

This generates a bunch of errors about incompatible type (__m128 vs _m128i). I'm pretty new to this, so I'm pretty sure I'm missing something obvious. Can anyone help?

tldr - How do I generate an __m128 vec filled with single precision constant floats with ms intrinsics?

Thanks for reading :)

JBeFat
  • 907
  • 10
  • 20
  • What makes you think you need to do this ? Typically constants are loaded only once, prior to a computational loop, so the relative cost of a memory access is negligible. – Paul R Jul 03 '11 at 21:33
  • I have several constants, all of which are used within a loop which unfortunately already seems to use all 8 xmm registers. Within vtune I get a very high CPI at the point at which some of these constants are used. I figured maybe if I could reduce the number of constants I'm accessing, and generate some instead, that might reduce the cost as one would hide the cost of the other. Also, weirdly, using the using the register keyword on one of the constants helped quite a bit (Even though that just resulted in some other value being pushed out of the xmm regs instead). – JBeFat Jul 04 '11 at 16:29
  • 4
    Use x86-64 if you can - that way you get 16 XMM registers. Also note that even if you get one or more cache misses the first time these constants are loaded this should get amortised over a large number of iterations where the constants will subsequently be in L1 cache. (Unless of course you only have a small number of loop iterations ?) – Paul R Jul 04 '11 at 18:00
  • Note that some compilers will generate a pxor instruction to zero `t` before use, even though you're *trying* to use it uninitialized. Depending on the compiler, you might have better luck starting with `_mm_set1_epi16(-1)`, since compilers know how to do that with pcmpeq. There's also `_mm_undefined_si128()`, which exists for exactly this kind of thing, but not all compilers support it. e.g. clang-3.5 doesn't, but clang-3.8 does. – Peter Cordes Jan 26 '16 at 12:43

2 Answers2

5

Try _mm_set_ps, _mm_set_ps1 or _mm_set1_ps.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • `0x1a11 251 movaps xmm6, xmmword ptr [0x414890]` `0x1a18 251 xorps xmm5, xmm5` Hi, thanks for taking the time to answer :) As you can see from the (Badly formatted, sorry) listing above __mm_set_ps doesn't really help me as it still uses movaps to load the constants from somewhere in memory. What I'd like is to use existing methods for generating constants directly within the xmm registers. – JBeFat Jul 03 '11 at 21:18
  • @JBeFat: Have you tried simply casting the result? Those tricks are using integer instructions to create floating-point values, so I'm not surprised that the compiler complains about a type mismatch. – Ben Voigt Jul 03 '11 at 21:36
  • 1
    Also note that there's no LHS store with `__mm_set_ps`, since the FPU isn't involved. – Ben Voigt Jul 03 '11 at 21:38
  • `MOVAPS` is actually as fast or faster nowadays on most CPUs, when reading from a warm cache. Also, int/float conversions can come with some hard to predict extra latencies on some (mostly AMD) processors, too. And lastly, the code is just abysmal. `_mm_set_ps` is descriptive and unambiguous. Some weird sequence of bit hacks will make you wonder what the hell you intended to do there, if you read your code 5 years from now. – Damon Jul 04 '11 at 09:57
  • @Damon: There's no conversion here, just the equivalent of `reinterpret_cast`, for the type-checker's benefit (compile time only). And an appropriately named inline function can take care of the readability problem. – Ben Voigt Jul 04 '11 at 13:33
  • 2
    @Ben Voigt: Yes, and that is the problem. Quoting [The microarchitecture of Intel, AMD and VIA CPUs](http://www.agner.org/optimize/microarchitecture.pdf): _"The XMM registers have some tag bits that are used for remembering whether floating point values are normal, denormal or zero. These tag bits have to be set __when the output of an integer instruction is used as input for a single or double precision floating point instruction__. This causes a so-called reformatting delay."_ – Damon Jul 04 '11 at 14:10
  • @Damon: Didn't know about that. That would make the constant-generation tricks useful only for generation of integral constants. Anyway, constant generation is realistically a VERY small part of any calculation, so I'd definitely use the `__mm_set_ps` family of instrinsics, which apparently translate to `MOVAPS`. – Ben Voigt Jul 04 '11 at 14:28
  • Thanks for the advice guys! I'll do some more profiling tonight and see what results I get. – JBeFat Jul 04 '11 at 16:23
3

Simply cast __m128i to __m128 using _mm_castsi128_ps. Also, the second line should be

t = _mm_slli_epi32(t, 25)
Norbert P.
  • 2,767
  • 1
  • 18
  • 22