31

I often see code that converts ints to doubles to ints to doubles and back once again (sometimes for good reasons, sometimes not), and it just occurred to me that this seems like a "hidden" cost in my program. Let's assume the conversion method is truncation.

So, just how expensive is it? I'm sure it varies depending on hardware, so let's assume a newish Intel processor (Haswell, if you like, though I'll take anything). Some metrics I'd be interested in (though a good answer needn't have all of them):

  1. # of generated instructions
  2. # of cycles used
  3. Relative cost compared to basic arithmetic operations

I would also assume that the way we would most acutely experience the impact of a slow conversion would be with respect to power usage rather than execution speed, given the difference in how many computations we can perform each second relative to how much data can actually arrive at the CPU each second.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mark
  • 3,806
  • 3
  • 21
  • 32
  • 2
    This isn't very meaningful to discuss without a specific system in mind. For starters, some systems don't even have a FPU. – Lundin Apr 29 '16 at 06:20
  • [floating point conversions and performance](http://stackoverflow.com/q/12920700/995714), [How to speed up floating-point to integer number conversion?](http://stackoverflow.com/q/429632/995714), [What is the fastest way to convert float to int on x86](http://stackoverflow.com/q/78619/995714), [Does typecasting consume extra CPU cycles](http://stackoverflow.com/q/16539412/995714) – phuclv Apr 29 '16 at 06:46
  • @Lundin: What processors does C++/cli support hmm? – Joshua Sep 26 '18 at 14:37
  • @Joshua You edit in a tag which wasn't there at the time that comment was made, several years ago, then post a snarky comment about it? Okay... Clarifications by the OP need to be in the question. We shouldn't need to read answers to understand the question. – Lundin Sep 27 '18 at 06:23
  • @Lundin: I added micro-optimization. C++/cli was already there. – Joshua Sep 27 '18 at 13:31
  • @Joshua Not according to the edit history: https://stackoverflow.com/posts/28668348/revisions – Lundin Sep 27 '18 at 13:35
  • @Lundin: That's gotta be a merged edit. There's no scenario in which I'd add a c++-11 tag. – Joshua Sep 27 '18 at 14:05
  • @Joshua You didn't. You incorrectly removed the c++ tag, added the c++-cli tag and added the micro-optimization tag. The C++11 tag was there from the OP. – Lundin Sep 27 '18 at 14:07

2 Answers2

38

Here's what I could dig up myself, for x86-64 doing FP math with SSE2 (not legacy x87 where changing the rounding mode for C++'s truncation semantics was expensive):

  1. When I take a look at the generated assembly from clang and gcc, it looks like the cast int to double, it boils down to one instruction: cvttsd2si.

    From double to int it's cvtsi2sd. (cvtsi2sdl AT&T syntax for cvtsi2sd with 32-bit operand-size.)

    With auto-vectorization, we get cvtdq2pd.

    So I suppose the question becomes: what is the cost of those?

  2. These instructions each cost approximately the same as an FP addsd plus a movq xmm, r64 (fp <- integer) or movq r64, xmm (integer <- fp), because they decode to 2 uops which on the same ports, on mainstream (Sandybridge/Haswell/Sklake) Intel CPUs.

    The Intel® 64 and IA-32 Architectures Optimization Reference Manual says that cost of the cvttsd2si instruction is 5 latency (see Appendix C-16). cvtsi2sd, depending on your architecture, has latency varying from 1 on Silvermont to more like 7-16 on several other architectures.

    Agner Fog's instruction tables have more accurate/sensible numbers, like 5-cycle latency for cvtsi2sd on Silvermont (with 1 per 2 clock throughput), or 4c latency on Haswell, with one per clock throughput (if you avoid the dependency on the destination register from merging with the old upper half, like gcc usually does with pxor xmm0,xmm0).

    SIMD packed-float to packed-int is great; single uop. But converting to double requires a shuffle to change element size. SIMD float/double<->int64_t doesn't exist until AVX512, but can be done manually with limited range.

    Intel's manual defines latency as: "The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction." But a more useful definition is the number of clocks from an input being ready until the output becomes ready. Throughput is more important than latency if there's enough parallelism for out-of-order execution to do its job: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?.

  3. The same Intel manual says that an integer add instruction costs 1 latency and an integer imul costs 3 (Appendix C-27). FP addsd and mulsd run at 2 per clock throughput, with 4 cycle latency, on Skylake. Same for the SIMD versions, and for FMA, with 128 or 256-bit vectors.

    On Haswell, addsd / addpd is only 1 per clock throughput, but 3 cycle latency thanks to a dedicated FP-add unit.

So, the answer boils down to:

1) It's hardware optimized, and the compiler leverages the hardware machinery.

2) It costs only a bit more than a multiply does in terms of the # of cycles in one direction, and a highly variable amount in the other (depending on your architecture). Its cost is neither free nor absurd, but probably warrants more attention given how easy it is write code that incurs the cost in a non-obvious way.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mark
  • 3,806
  • 3
  • 21
  • 32
  • 7
    For clarity: Agner Fog's awesome manual "Instruction Tables" reports that on Haswell, the _integer_ register-register `add` has latency = 1, reciprocal throughput = 0.25; integer register-register `mul/imull` 64x64-bit has lat = 3, 1/thru = 1, floating-point register-register `addss/ps/sd/pd` has lat = 3, 1/thru = 1, floating-point register-register `mulss/ps/sd/pd` has lat = 5, 1/thru = 0.5, and the various `cvt*` conversions between 32-bit and 64-bit integers and floating-point values for the most part have lat = 3-4 and 1/thru = 1. – Iwillnotexist Idonotexist Feb 23 '15 at 07:27
  • 1
    @IwillnotexistIdonotexist - Thorough :). Much obliged! – Mark Feb 23 '15 at 07:29
  • Yes, float/double <-> int conversion costs about as much as FP addition, and in fact runs on the same execution units. (https://agner.org/optimize/). SIMD float<->int is efficient, but SIMD double<->int needs a shuffle uop as well, and SIMD float/double<->int64_t doesn't exist until AVX512. – Peter Cordes Sep 26 '18 at 14:52
  • 1
    Ended up making a major edit to correct this answer. There were some serious gaps in the picture you came up with based on digging into manuals back when you wrote this. I maybe should have just written my own answer, so let me know if you want to roll this back; I can put what I wrote in a separate answer. – Peter Cordes Sep 26 '18 at 15:24
  • Thanks to everyone involved for this fantastic answer. This has been super useful! – iestyn Oct 20 '21 at 22:24
8

Of course this kind of question depends on the exact hardware and even on the mode.

On x86 my i7 when used in 32-bit mode with default options (gcc -m32 -O3) the conversion from int to double is quite fast, the opposite instead is much slower because the C standard mandates an absurd rule (truncation of decimals).

This way of rounding is bad both for math and for hardware and requires the FPU to switch to this special rounding mode, perform the truncation, and switch back to a sane way of rounding.

If you need speed doing the float->int conversion using the simple fistp instruction is faster and also much better for computation results, but requires some inline assembly.

inline int my_int(double x)
{
  int r;
  asm ("fldl %1\n"
       "fistpl %0\n"
       :"=m"(r)
       :"m"(x));
  return r;
}

is more than 6 times faster than naive x = (int)y; conversion (and doesn't have a bias toward 0).

The very same processor, when used in 64-bit mode however has no speed problems and using the fistp code actually makes the code run somewhat slower.

Apparently the hardware guys gave up and implemented the bad rounding algorithm directly in hardware (so badly rounding code can now run fast).

6502
  • 112,025
  • 15
  • 165
  • 265
  • 1
    On what platform did you come to the conclusion that it is 6x faster? A year or two back, I got involved with a similar question, where someone asked why the code in your answer was better, and my immediate response was "how do you know it's better", and it very much turns out that if you have an SSE capable processor (so for x86, something introduced since about 2000), then it's faster to NOT use this trick, but just let the compiler generate the "right" instruction. I'll see if I can find my answer, but have to go to work now, so will do later. – Mats Petersson Feb 23 '15 at 09:43
  • 1
    @MatsPetersson: this was tested on an i7 but compiling `-m32`, the problem is not present (actually it's faster to use naive conversion) when compiling 64 bit code. – 6502 Feb 23 '15 at 11:52
  • 3
    What if you use `-m32 -msse2`? – Mats Petersson Feb 23 '15 at 21:55
  • @MatsPetersson: You'd need to use `-m32 -msse2 -mfpmath=sse` to actually *use* SSE2 for scalar FP math. Or `-m32 -msse3` for [`fisttp` (truncating conversion without changing the rounding mode)](http://felixcloutier.com/x86/FISTTP.html). gcc targeting x86-64 defaults to `-mfpmath=sse`, of course, but 32-bit still defaults to x87, mostly only using SSE when auto-vectorizing, IIRC. – Peter Cordes Sep 26 '18 at 14:44
  • Or convert to integer with round-to-nearest, if you can find a function that your compiler fully inlines with `fistp` or `cvtsd2si`, like `(int)rint(x)` or `lrint(x)`, maybe with `-fno-math-errno`. [round() for float in C++](https://stackoverflow.com/a/47347224) – Peter Cordes Sep 26 '18 at 14:44
  • This inline asm is very inefficient. You don't need the input to be in memory; you can ask for it to be on the top of the x87 stack, and use appropriate constraints to tell the compiler you pop it. And truncation isn't "bad" per-se, it's just different (and arguably was a poor choice for C's default behaviour, but that bridge was crossed *long* ago. Give it a name like my_round to make it clear that it rounds instead of truncates. See also [round() for float in C++](https://stackoverflow.com/posts/comments/81806400). – Peter Cordes Sep 26 '18 at 14:50