1

Why is the cmath library is so slow in terms of rounding (round, ceil, floor, trunc)?

We are talking about a factor of 10 compared to SSE (roundsd, cvtsd2si) or good old FPU (FIST(P)), the latter being a bit a slower (20-25%), getting closer with rising clock frequency.

I've read an article by L de Soras, and his description is quite clear. The immediate parameter of rounds(p)d allows for selecting any possible schema. Checking the disassembly of round I could not detect any LDMXCSR command, just CVTTSS2SI (scalar conversion to int /w trunc).

So, why is there a 1000% longer wait on a really often needed functionality?

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Lois
  • 31
  • 4
  • 1
    One slowdown candidate: "The round functions round their argument to the nearest integer value in floating-point format, rounding halfway cases away from zero, regardless of the current rounding direction." implies `round()` must save away rounding mode, change to halfway, round, then restore round mode. Perhaps all in a atomic fashion. Compare to `rint(), nearbyint()`, which use current rounding mode. – chux - Reinstate Monica May 05 '20 at 17:25
  • 2
    `roundsd` is SSE4.1 so a library can't use it without doing runtime CPU dispatching. For `round()` specifically, `roundsd` doesn't implement the rounding mode that `round()` specifies. `round()` is often not the rounding function you're looking for: [Merit of inline-ASM rounding via putting float into int variable](https://stackoverflow.com/a/37624488) / [round() for float in C++](https://stackoverflow.com/a/47347224) – Peter Cordes May 05 '20 at 17:40
  • 3
    *"why the library"* - not everyone uses your version of your compiler (assuming you're talking `std::round` etc). These sort of questions should mention a *specific* library implementation. – Tony Delroy May 05 '20 at 17:44
  • Thanks, @Tony. It is just with MSVC. – Lois May 05 '20 at 18:29
  • This is an FPU lround(): FLDZ FLD QWORD PTR [RCX] FLD ST FABS FLD ST FISTP QWORD PTR tmp FILD QWORD PTR tmp MOV RDX, QWORD PTR tmp FSUBP FLD QWORD PTR halfway ; 0.5 FCOMPP FSTSW AX SAHF JNE NotHalfway INC RDX NotHalfway: FCOMPP FSTSW AX SAHF JNB Positive NEG RDX Positive: MOV RAX, RDX It's 4-5 times faster than lround, but also c. 3 times slower than SSE4 rounding. What I refer to is amateur hobby, what is the real deal with those standard libraries? I go on with round() now – Lois May 06 '20 at 01:31
  • As expected a subtitute for `round()` such as `void round(double * in, double * out);` is much faster than a routine returning an int: twice as fast. FPU is only only 30% slower than SSE calls, 5 times faster than round(). `FLD QWORD PTR [RCX] FLD ST FABS FLD ST FRNDINT FLD ST FXCH ST(2) FSUBRP FLD QWORD PTR halfway ; 0.5 FCOMPP FSTSW AX SAHF JNE NotHalfway FLD1 FADDP ST(1),ST NotHalfway: FXCH FLDZ FCOMPP FSTSW AX SAHF JB Positive FCHS Positive: FSTP QWORD PTR [RDX] RET` Please keep in mind that the library uses SSE routines on by boxes. – Lois May 06 '20 at 03:38
  • To finish my investigations: there is no "current" rounding mode. The machinery (FPU, SSE...) works independently. SSE is fastest with trunc mode selected by cmd: `MOVAPD XMM1,XMM0 ADDSD XMM0, half ROUNDSD XMM0,XMM0,RndToZ COMISD XX1,zero JA Positive SUBSD XMM0,one Positive:` The FPU has no indy rnd, but has to msk/rstr rnd mode. SSE:FPU:cmath (with SSE cmds) show speeds 1:1.5-2:5-10. FPU w/o CW protection using FIST/FILD for FRNDINT is just faster than SSE, runs anywhere. Casts are dear. Even ` trunc()` is 40% faster: `double y = trunc(x+0.5); return (x<0.0?y-1:y);` So, why so slow? – Lois May 07 '20 at 21:22

0 Answers0