Super-fast rounding function (PBC)

Question

I really need very fast round() function in C - it is necessary for Monte Carlo particle modeling: at every step you need to wrap coordinates into periodic box to compute volume interactions : for example

for(int i=0; i < 3; i++)
{
    coor.x[i] = a.XReal.x[i]-b.XReal.x[i];
    coor.x[i] = coor.x[i] - SIZE[i]*round(coor.x[i]/SIZE[i]); //PBC
}

I've come across some asm hacking with it, but i don't understand asm at all:) something like this

inline int float2int2(float flt)
{
  int intgr;

  __asm__ __volatile__ ("fld %1; fistp %0;" : "=m" (intgr) : "m" (flt));

  return intgr;
}

With fixed boundaries, without round() it works faster. So, maybe someone knows a better way?..

The asm you've found is all but useless: It's architecture specific (x86) and uses the old x87 fpu (slow, horrible). If you're on x86-64, you're much better of with sse-instructions, which any sane compiler will automatically emit. — EOF, Apr 08 '16 at 11:12
thanks! I forgot to mention - nearest-int rounding is needed. like 0.8 -> 1, -2.7 -> -3 — Rsmmnt, Apr 08 '16 at 11:14
Have you tried how the standard function (`remainder()`) that does this compares performance-wise? — EOF, Apr 08 '16 at 11:24
`round()` rounds "to nearest, ties away from zero", the likely faster `rint()` rounds "to nearest, ties to even". If you can tolerate the difference, try `rint()`. Performance wise I would be more concerned about the floating-point division in this code. What are typical values of SIZE[i], and are these compile-time constants? — njuffa, Apr 08 '16 at 17:31
This question has a very useful hack that works on a limited range of inputs: http://stackoverflow.com/questions/17035464/a-fast-method-to-round-a-double-to-a-32-bit-int-explained — tmyklebu, Apr 09 '16 at 15:06

nwellnhof · Answer 1 · 2016-04-08T22:44:55.303

4

First of all, you can get some gains by using the right compiler options. With GCC and a modern Intel CPU for example, you should try:

-march=nehalem -fno-trapping-math

Then the problem with round is that it uses a specific rounding mode, which is slow on most platforms. nearbyint (or rint) should always be faster:

coor.x[i] = coor.x[i] - SIZE[i] * nearbyint(coor.x[i] / SIZE[i])

Have a look a the generated assembly.

You should also think about vectorizing your code.

edited Apr 08 '16 at 22:44

answered Apr 08 '16 at 12:49

nwellnhof

32,319
7
89
113

Wouldn't `-ffast-math` be a better choice? It includes `-funsafe-math-optimizations` as well as a few other options. – David Wohlferd Apr 08 '16 at 21:32
1

@DavidWohlferd `-ffast-math` doesn't make rounding any faster. Actually, it's enough to use `-fno-trapping-math` to make GCC generate `roundsd` instructions. I edited my answer. – nwellnhof Apr 08 '16 at 22:47
1

Also, clang emits `roundsd` for `nearbyint` even without `-fno-trapping-math`. This seems to be correct since `nearbyint` doesn't raise floating point exceptions. – nwellnhof Apr 08 '16 at 22:56

score 2 · Answer 2 · edited May 23 '17 at 12:13

Instead of looking for just fast rounding, ideally you want the whole process of range-reduction into the periodic box to be fast. As @EOF accurately pointed out in a comment, you could use a C99 standard function like remainderf() or fmodf().

coor.x[i] -= SIZE[i]*round(coor.x[i]/SIZE[i]);
// same as
coor.x[i] = remainderf(coor.x[i], SIZE[i]);

fmodf(3) rounds towards zero, remainderf(3) rounds towards nearest.

The remainder() function computes the remainder of dividing x by y. The return value is x-n*y, where n is the value x / y, rounded to the nearest integer. If the absolute value of x-n*y is 0.5, n is chosen to be even.

Compilers / libraries have several different strategies for implementing these. With -ffast-math, gcc 5.3 for x86-64 inlines a remainder(x,y) implementation that transfers the values from SSE registers to x87 registers, and runs FPREM1 (partial remainder) in a loop until it sets a flag indicating that the result is correct. (One execution of FPREM1 can reduce the exponent by at most 63).

clang always emits a call to the library function, either the normal remainder entry point, or __remainder_finite with -ffast-math.

The GNU libm definition uses mostly integer operations, AFAICT from the disassembly and the C source. On a recent Intel CPU with fast hardware divide, it might be slower than your div, round, mul version.

So you have three options:

div, round, mul, sub, with fast rounding (use nearbyint(), it apparently has the least ugly semantics so it can inline to roundsd / roundss most easily). This way can vectorize, and do all three coordinates at once. May need to do it manually, to find something that won't fault for the 4th element. On Intel Haswell with 128b vectors: 5 uops. single-precision: divps(10-13c latency, one per 7c throughput), roundps(2 uops, 6c latency, one per 2c throughput), mulps(5c latency, one per 0.5c throughput), subps(3c latency, one per 1c throughput). Some of these compete with each other for execution ports. Total latency: 27c. Probable throughput, maybe something like one per 7c (totally bottlenecked by divps)
gcc's inlined x87 FPREM1. (probably only needs to run one iteration, so on Haswell: 41 uops, 27c latency, one per 17c throughput, plus some overhead for getting data between xmm and x87 regs. Can't vectorize.
glibc's mostly-integer implementation: no idea, probably worse than either of the other two, on modern x86 CPUs. But, probably significantly higher accuracy than the manual div/round/mul/sub.

Bottom line, if this is a speed issue, you should definitely look into vectorizing with SSE/AVX to do all three coordinates of a point in one vector. Or, a coordinate of four points at once, or whatever is convenient. Ideally you can make use of all 4 (or 8 with AVX) single-precision elements of the vector ALUs. (or 2 / 4 for double-precision).

Even scalar, I think your current code with nearbyint() is going to be the fastest choice, but you can easily go three times faster than that with vectors.

Super-fast rounding function (PBC)

2 Answers2

So you have three options: