4

Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following:

#include <emmintrin.h>  // SSE2

static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111};  // Dividend

__attribute__((noinline)) static void test_div_x86(unsigned i){
    for(; i; --i)
        x[0] /= i,
        x[1] /= i,
        x[2] /= i,
        x[3] /= i,
        x[4] /= i,
        x[5] /= i,
        x[6] /= i,
        x[7] /= i;
}

__attribute__((noinline)) static void test_div_sse(unsigned i){
    for(; i; --i){
        __m128i xmm0 = _mm_loadu_si128((const __m128i*)x);
        __m128 xmm1 = _mm_set1_ps(i);
        _mm_storeu_si128(
            (__m128i*)x,
            _mm_packs_epi32(
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpacklo_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                ),
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpackhi_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                )
            )
        );
    }
}

int main(){
    const unsigned runs = 40000000; // Choose a big number, so the compiler doesn't dare to unroll loops and optimize with constants
    test_div_x86(runs),
    test_div_sse(runs);
    return 0;
}

The results by GNU Gprof and tools parameters.

/*
gcc -O? -msse2 -pg -o test.o -c test.c
g++ -o test test.o -pg
test
gprof test.exe gmon.out
-----------------------------------
        test_div_sse(unsigned int)      test_div_x86(unsigned int)
-O0     2.26s                           1.10s
-O1     1.41s                           1.07s
-O2     0.95s                           1.09s
-O3     0.77s                           1.07s
*/

Now i'm confused why the x86 test barely gets optimized and the SSE test becomes faster though the expensive conversion to & from floating point. Furthermore i'd like to know how much results depend on compilers and architectures.

To summarize it: what's faster at the end: dividing one-by-one or the floating-point detour?

Community
  • 1
  • 1
Youka
  • 2,646
  • 21
  • 33
  • It seems `unsigned short x[8]` gets modified in the first test. So the second test starts with a different value. (And all `x` values seem to go to `0` quickly.) – AlexD Jul 22 '15 at 23:43
  • The operation duration shouldn't be influenced by the parameters and the compiler shouldn't detect when `x` is 0 anywhere. It makes no difference when i set local dividends, so `x` doesn't matter. – Youka Jul 23 '15 at 01:26
  • You can do even better: Instead of dividing, you can multiply with the modular inverse of the dividend. It's what compilers will do with divisions by compile-time constants, and you can do it at runtime too, with a bit of work. – EOF Jul 23 '15 at 13:59
  • 1
    @Youka "_The operation duration shouldn't be influenced by the parameters_" I would say they do, at least in a general case. E.g. one of Intel's guides says: "_The throughput of “DIV/IDIV r64” varies with the number of significant digits in the input RDX:RAX_". I'm not saying that it is important in your case though. – AlexD Jul 23 '15 at 21:44

1 Answers1

3

Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.

re: your code: You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?

Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)

You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.

This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.

Other stuff in your code:

You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)

Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.

If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.

performance on recent Intel CPUs

divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.

So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.

If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.

Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • `_mm_cvtepi16_epi32` needs SSE4.1 but i want to support older processors too, so no option. In my case, i can't hold data in FP format because i get them as pixels (_unsigned char_) and process them in a modular function. The trick with multiply and shift is good with constants, but my values are variable. Your second sentence was very useful because indeed with `-O3` `x` gets converted to FP just once and stays in `xmm` registers until the end, the loop counter becomes floating point too - the expensive repeating conversions fell away. – Youka Jul 23 '15 at 21:48
  • Seems i need a better test code for stating performance results. I thought `-O0` gives a good general hint. But considering with `-O3` **div_sse** is just a bit faster than **div_x86** and this without all these conversions... – Youka Jul 23 '15 at 21:54
  • `div_sse` still converts from int to float and back, doesn't it? I could see a compiler skipping the store/load in the loop, but not the convert. Also note that you're testing division in a loop where that's ALL that can happen. In practice, there will be other stuff that can overlap with the dependency chain of the div, right? – Peter Cordes Jul 24 '15 at 01:05
  • I googled and found an integer division of vector / scalar, even when the scalar isn't a compile-time constant. I didn't check the details, i.e. whether it's an approximation. I updated my answer (new first paragraph.) – Peter Cordes Jul 24 '15 at 01:10
  • Your mentioned link is the same as the base of my question (see first words). This method doesn't work for me because it's just an approximation and calculating the magic numbers is expensive, so just worth by massive use of the same divisor. – Youka Jul 24 '15 at 01:29
  • Oops, sorry. I'll leave it in my answer, for the benefit of other people that get here from google. Maybe their problems will benefit from it. – Peter Cordes Jul 24 '15 at 01:35
  • I checked the [assembly](http://pastebin.com/Z4YZTdLA) by `-O3` and it's quite similar to the C code, just converted the **while** loop to **do...while** and reuses the register with zeros. I really thought the float conversion happens just once, but i was wrong. It really seems to be right that the detour with floating point division is faster... just how much `-O0` did wrong o_O – Youka Jul 24 '15 at 01:40
  • `-O0` probably stored temporaries to RAM and back at every step. – Peter Cordes Jul 24 '15 at 02:02
  • 1
    I thought of a better way to explain what I was trying to say: If you need to divide many independent values (more than 8), vectors are absolutely the way to go. All the unpack / convert / re-convert / pack instructions are fully pipelined (one result per cycle). They should be able to keep the non-pipelined FP divider occupied, if they don't have to wait for the output of the previous FP divide before the next cycle of convert/pack / unpack/convert can execute. This is the diff between throughput and latency. The FP divider produces 4 results faster than the integer divider produces 1. – Peter Cordes Jul 24 '15 at 02:09
  • The SSE C functions of GCC are just very simple implemented and need a high optimization level to become what they intended to be. `-O0` produced a terrible mess with much memory movement and mixed instructions. It's all as you wrote, Peter: the conversion and packing has no weight compared to the slow speed of `div`. Thanks for your detailed explanations! – Youka Jul 24 '15 at 02:23
  • It's worth point out (maybe you already did) that if the divisors are constants known at compile time that the compiler will already convert the operations to a multiplication and a shift (look at the assembly). It probably won't do this for the iterators of a loop though even if the range is known at compile time but this could be done at run-time by hand. I have done something like that before. – Z boson Jul 24 '15 at 13:15
  • I improved my answer some, to incorporate some stuff I explained in comments – Peter Cordes Jul 24 '15 at 19:44