36

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).

Here are some benchmarks for performing floor for all cells in a 2D matrix:

VC:  0.10
ICL: 0.20

Compare that to a simple cast:

VC:  0.04
ICL: 0.04

How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers). 2nd question: Does someone know of a super-fast floor() implementation?

PS: Here is the loop that I was benchmarking:

void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
    float *rowA=NULL;
    int   *intRowA=NULL;
    int   row, col;

    for(row=0 ; row<height ; ++row){
        rowA = matA + row*width_aligned;
        intRowA = intA + row*width_aligned;
#pragma ivdep
        for(col=0 ; col<width; ++col){
            /*intRowA[col] = floor(rowA[col]);*/
            intRowA[col] = (int)(rowA[col]);
        }
    }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

8 Answers8

45

A couple of things make floor slower than a cast and prevent vectorization.

The most important one:

floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).

Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.

Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.

Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
21

If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):

inline int int_floor(double x)
{
  int i = (int)x; /* truncate */
  return i - ( i > x ); /* convert trunc to floor */
}
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
dgobbi
  • 491
  • 4
  • 6
  • 2
    you should probably use `static inline` instead of `inline` if you want to put this into a header file - see http://stackoverflow.com/a/10245969/48015 – Christoph Mar 23 '13 at 19:37
13

Branch-less Floor and Ceiling (better utilize the pipiline) no error check

int f(double x)
{
    return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}

int c(double x)
{
    return (int) x + (x > (int) x);
}

or using floor

int c(double x)
{
    return -(f(-x));
}
PolarBear2015
  • 745
  • 6
  • 14
4

The actual fastest implementation for a large array on modern x86 CPUs would be

  • change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
  • loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)

    cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.

  • restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)

This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.

With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.


Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)


(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    +1. Sidenote: Somehow icc doesn't seem to know that `vcvtps2dq` depends on the value of the MXCSR control and status register. In [this example](https://gcc.godbolt.org/z/qyC6VB) the order of `x=_mm_cvtps_epi32(y);` and `_MM_SET_ROUNDING_MODE(_MM_ROUND_NEAREST);` has been exchanged by icc. – wim Nov 07 '18 at 14:01
  • 1
    @wim: Yeah I wondered if that would be a problem. I should add something about `#pragma STDC FENV_ACCESS ON`, if that works for any actual compilers. ([Does FENV\_ACCESS pragma exist in C++11 and higher?](https://stackoverflow.com/q/36218367)). And/or try ICC compile options like [`-fp-model strict`](https://software.intel.com/en-us/node/522979) to tell it that you modify the FP rounding mode. (The ICC default is `-fp-model fast=1`.) – Peter Cordes Nov 07 '18 at 14:42
2

An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.

long fast_floor(double x)
{
    const unsigned long offset = ~(ULONG_MAX >> 1);
    return (long)((unsigned long)(x + offset) - offset);
}

long fast_ceil(double x) {
    const unsigned long offset = ~(ULONG_MAX >> 1);
    return (long)((unsigned long)(x - offset) + offset );
}

As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.

On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --

inline int floor_x64(double x) {
   return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
   return (int)(x + 0x40000000) - 0x40000000;
}
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • does this depend on `long` being wider than `int` for correctness with the full range of `int` results? That's not the case on many 32-bit platforms, and on x86-64 Windows (a LLP64 ABI where int and long are both 32-bit). So maybe you should use `long long`. But still a nice idea. – Peter Cordes Nov 02 '18 at 06:28
  • Yes (that is long int being wider than int), but I think this can be mitigated by casting to unsigned int. – Aki Suihkonen Nov 02 '18 at 06:53
  • `double` -> `unsigned long` is somewhat slow on x86. https://godbolt.org/z/1UqaQw. x86-64 doesn't have an instruction for that until AVX512, only for `double` -> signed integer. On 32-bit x86 where `unsigned long` is a 32-bit type, x87 `fistp` can do FP -> 64-bit signed integer, and you can use the low half of that as `unsigned int`. But truncation requires either SSE3 `fisttp` or changing the rounding mode. SSE2 can't do truncation to a 32-bit unsigned integer or 64-bit signed integer either. The other answers are probably more efficient. – Peter Cordes Nov 02 '18 at 07:21
2

Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.

I sometimes use a macro to approximate floor():

#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))

It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
jrgc
  • 150
  • 1
-1
  1. They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result. Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
  2. Probably floor() is already optimized.
  3. Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler? Jon Bentley's Programming Pearls has a great review of possible optimizations.
Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • 2
    Never assume that the standard libraries are optimized. They are almost always extremely slow. You can sometimes get major speed gains by using your own custom code. – Joe Feb 26 '15 at 19:21
  • floor() is a function, but it's commonly used enough for compilers to treat it as a builtin, like memcpy or sqrt and inline it if they want to. e.g. GCC `-O2` for x86-64 inlines it even when it takes multiple instructions, without SSE4.1 for `roundss` / `roundps` (https://godbolt.org/z/5jdTvcx7x). But yeah, without SSE4.1 it's a lot slower than fp->int with truncation, which has faster HW support. – Peter Cordes Mar 24 '21 at 02:29
-3

Fast double round

double round(double x)
{
    return double((x>=0.5)?(int(x)+1):int(x));
}

Terminal log

test custom_1 8.3837

test native_1 18.4989

test custom_2 8.36333

test native_2 18.5001

test custom_3 8.37316

test native_3 18.5012


Test

void test(char* name, double (*f)(double))
{
    int it = std::numeric_limits<int>::max();

    clock_t begin = clock();

    for(int i=0; i<it; i++)
    {
        f(double(i)/1000.0);
    }
    clock_t end = clock();

    cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;

}

int main(int argc, char **argv)
{

    test("custom_1",round);
    test("native_1",std::round);
    test("custom_2",round);
    test("native_2",std::round);
    test("custom_3",round);
    test("native_3",std::round);
    return 0;
}

Result

Type casting and using your brain is ~3 times faster than using native functions.

Jan Cajthaml
  • 403
  • 3
  • 13
  • Your `round()` function doesn't work. You need to either use a floating-point modulo to check whether the fractional part is greater than 0.5, or you could use the old `(int) (double_value + 0.5)` trick to perform rounding. – Jason R Feb 03 '14 at 14:15
  • For FP->int with round-to-nearest, see https://stackoverflow.com/a/47347224/224132. – Peter Cordes Jan 09 '18 at 04:21