Why is the cmath
library is so slow in terms of rounding (round
, ceil
, floor
, trunc
)?
We are talking about a factor of 10 compared to SSE (roundsd
, cvtsd2si
) or good old FPU
(FIST(P)
), the latter being a bit a slower (20-25%), getting closer with rising clock frequency.
I've read an article by L de Soras, and his description is quite clear. The immediate parameter of rounds(p)d allows for selecting any possible schema. Checking the disassembly of round
I could not detect any LDMXCSR command, just CVTTSS2SI (scalar conversion to int /w trunc).
So, why is there a 1000% longer wait on a really often needed functionality?