INTEL SIMD: why is inplace multiplication so slow?

Question

I have written some vector-methods that do simple math inplace or copying and that share the same penalty for the inplace variant.

The simplest can be boiled down to something like these:

void scale(float* dst, const float* src, int count, float factor)
{
    __m128 factorV = _mm_set1_ps(factorV);

    for(int i = 0; i < count; i+= 4)
    {
        __m128 in = _mm_load_ps(src);
        in = _mm_mul_ps(in, factorV);
        _mm_store_ps(dst, in);

        dst += 4;
        src += 4;
    }
}

testing code:

for(int i = 0; i < 1000000; i++)
{
    scale(alignedMemPtrDst, alignedMemPtrSrc, 256, randomFloatAbsRange1);
}

When testing, i.e. repeatedly operating this function on the SAME buffers, I found that if dst and src are the same, speed is the same. If they are different, its about a factor 70 faster. The main cycles burned on writing (i.e. _mm_store_ps)

Interestingly the same behaviour does not hold for addition, i.e. += works nicely, only *= is a problem..

--

This has been answered in the comments. It's denormals during artificial testing.

A factor of 70? Did you compile with optimization disabled or something? It smells like you're bottlenecking on store-forwarding latency somehow, instead of multiply throughput. That could explain a factor of ~7 or so, but not 70. `addps` and `mulps` are not very different in performance on Intel hardware (https://agner.org/optimize/), so there's something weird going on. What compiler/options/hardware, and what does the resulting asm look like? — Peter Cordes, Nov 27 '18 at 20:51
@AlanBirtles: The code is using `_mm_store_ps`, not `storeu`, so it would fault on unaligned unless the compiler uses `movups` anyway. That is a possibility for ICC and MSVC though. — Peter Cordes, Nov 27 '18 at 21:23
Does your `factor` produce a subnormal result? non-zero but smaller than `FLT_MIN`? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to cause FP assists. (It doesn't take extra time to produce a +Inf or NaN result, but it does for gradual underflow to subnormal. That's one reason `-ffast-math` sets DAZ/FTZ - flush-to-zero on underflow.) — Peter Cordes, Nov 27 '18 at 21:25
@PeterCordes While thinking about the minimal complete verifiable example (I did NOT expect people to actually run my code), I found that indeed this code produces denormals when repeatedly multiplying a buffer with something abs-smaller than 1.f. I'm used to finding denormals in states (e.g. IIR-filters), but did not think of this (which is a testing artefact). It all makes sense now (inplace, *= but not += ..) Thanks a bunch!! If you copy your comment to an answer I can accept it. — Eike, Nov 28 '18 at 07:46
This is why we ask for a [mcve], when you've been staring at some code for ages and can't see what's wrong with it the issue is often outside that code, when you only post the code you think has the issue we can't see the problem code. — Alan Birtles, Nov 28 '18 at 08:09
@AlanBirtles yeah, thanks. I know this as well from asking questions in real 3D.. when formulating the question so that somebody else can understand it (i.e. has all the context), normally I have my answer allready. It was late yesterday :( — Eike, Nov 28 '18 at 08:29

score 7 · Accepted Answer · answered Nov 28 '18 at 08:37

Does your factor produce a subnormal result? Non-zero but smaller than FLT_MIN? If there's a loop outside this that loops over the same block in-place repeatedly, numbers could get small enough to require slow FP assists.

(Turns out, yes that was the problem for the OP).

Repeated in-place multiply makes the numbers smaller and smaller with a factor below 1.0. Copy-and-scale to a different buffer uses the same inputs every time.

It doesn't take extra time to produce a +-Inf or NaN result, but it does for gradual underflow to subnormal on Intel CPUs at least. That's one reason -ffast-math sets DAZ/FTZ - flush-to-zero on underflow.

I think I've read that AMD doesn't have FP-assist microcoded handling of subnormals, but Intel does.

There's a performance counter on Intel CPUs for fp_assist.any which counts when a sub-normal result requires extra microcode uops to handle the special case. (I think it's as intrusive as that for the front-end and OoO exec. It's definitely slow, though.)

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

Why is icc generating weird assembly for a simple main? (shows how ICC sets FTZ/DAZ at the start of main, with it's default fast-math setting.)

INTEL SIMD: why is inplace multiplication so slow?

1 Answers1