Forms of constants for high performance addition and multiplication for double

Question

I need to efficiently add or multiply some constants to a result of type double in a loop to prevent underflow. For example, if we have int, multiplying with a power of 2 will be fast as the compiler will use bit shift. Is there a form of constants for efficient double addition and multiplication?

Edit: It seems that not many understand my question, apologies for my sloppiness . I will add some code. If a is a int, this (multiplying with a power of 2) will be more efficient

int a = 1;
for(...)
    for(...)
        a *= somefunction() * 1024;

than when 1024 is replaced with say 1023. not sure what is the best if we want to add to a int, but that is not of my interest. I am interested in the case where a is a double. What are the forms of constants (e.g. power of 2) that we can efficiently add and multiply to a double? The constant is arbitrary, just need to be large enough to prevent underflow.

This is probably not restricted to C and C++ only, but I do not know of a more appropriate tag.

Would probably help to have a code example. BTW your assumption that bit shift on an integer is faster than multiplication by a power of two is incorrect. If you write `x*4` the compiler will convert that to `x<<2` automatically. — jahhaj, Aug 04 '12 at 06:53
sounds like you are doing premature optimization, it is often not a good idea instead it is better to let the compiler handle that for you. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil", by Donald Knuth — AndersK, Aug 04 '12 at 06:55
@jahhaj I know. i was trying to say that multiplying by 2 or 4 is faster than by 3. — ggg, Aug 04 '12 at 06:57
@Anders K, not sure that such multiplication on real is actually optimizied any further. There's a lot of libc function that let you split/merge iEEE 754 numbers and most of their use case is "you want to multiply this double by this arbitrary power of 2". — Joel Falcou, Aug 04 '12 at 07:00
@ggg, if you have such constants in your code, modern compilers are usually capable to do this kind of optimizations themselves. Did you check the assembler that your compiler produces, e.g with `-S`? Did you upgrade your compiler to the most recent version? Did you ensure that you have the optimal optimization options for your platform, eg `-O3 -march=native` ? Did you benchmark your code? — Jens Gustedt, Aug 04 '12 at 07:01
Given the level of the question in general, this almost certainly a case of **premature optimization**. Concentrate on getting the code *correct*. Only when it passes all tests (do write tests that can be reproduced exactly), consider whether it's too slow for the general user. — Cheers and hth. - Alf, Aug 04 '12 at 07:19
@Cheersandhth.-Alf This will make the code no less readable (only changing a constant), and improves the performance. — ggg, Aug 04 '12 at 07:28
Assuming the compiler will optimize something is a terrible thing to do. If you want to make sure the compiler will bit shift, then do this in your code. — amb, Aug 04 '12 at 07:31
@ggg: Multiply by a constant power of two. That will be more efficient for doubles as well. — jahhaj, Aug 04 '12 at 08:14
@jahhaj what about addition? you are answering the question right? Write this as an answer. — ggg, Aug 04 '12 at 08:26
@ggg: Ignore previous comment, it was written in haste. I think the true answer depends on your architecture. So you should try different methods and time them. — jahhaj, Aug 04 '12 at 09:49
@amb - Check this example of what a compiler actually will do, and stop worry about bit shifts. http://stackoverflow.com/a/11639305/597607 — Bo Persson, Aug 04 '12 at 10:20
@BoPersson *"A compiler"* does not mean *"every compiler"*. For example compare a compiler that is used for embedded (so the code needs to be extra optimized for speed and memory) to some other one that is used for desktop (where there is plenty of speed and memory). There are **huge** differences. Also, premature optimization is the root of all evil. I don't think these days (with cheap and powerful hardware) anyone should optimize every single thing, unless it is really necessary. — amb, Aug 04 '12 at 10:41

Eric Postpischil · Accepted Answer · 2012-08-04T17:44:15.187

On most modern processors, simply multiplying by a power of two (e.g., x *= 0x1p10; to multiply by 2¹⁰ or x *= 0x1p-10; to divide by 2¹⁰) will be fast and error-free (unless the result is large enough to overflow or small enough to underflow).

There are some processors with “early outs” for some floating-point operations. That is, they complete the instruction more quickly when certain bits are zero or meet other criteria. However, floating-point addition, subtraction, and multiplication commonly execute in about four CPU cycles, so they are fairly fast even without early outs. Additionally, most modern processors execute several instructions at a time, so other work proceeds while a multiplication is occurring, and they are pipelined, so, commonly, one multiplication can be started (and one finish) in each CPU cycle. (Sometimes more.)

Multiplying by powers of two has no rounding error because the significand (fraction portion of the value) does not change, so the new significand is exactly representable. (Except, multiplying by a value less than 1, bits of the significand can be pushed lower than the limit of the floating-point type, causing underflow. For the common IEEE 754 double format, this does not occur until the value is less than 0x1p-1022.)

Do not use division for scaling (or for reversing the effects of prior scaling). Instead, multiply by the inverse. (To remove a previous scaling of 0x1p57, multiply by 0x1p-57.) This is because division instructions are slow on most modern processors. E.g., 30 cycles is not unusual.

Many new things I didn't know. Thanks. – kolenda Aug 04 '12 at 14:53 — kolenda, Aug 04 '12 at 14:53

huseyin tugrul buyukisik · Answer 2 · 2012-08-04T10:43:19.333

2

First get your double in a union and select the "range" and "exponent" parts. Then only shift the "exponent" or the "range" parts. Look for IEEE floating point standards. Dont forget the sign and last mantissa bit.

union int_add_to_double
{
double this_is_your_double_precision_float;
struct your_bit_representation_of_double
    {
    int range_bit:53;//you can shift this to make range effect
    //i dont know which is mantissa bit. maybe it is first of range_bit. google it.
    int exponent_bit:10;   //exponential effect
    int sign_bit:1;     //take negative or positive
    }dont_forget_struct_name;
}and_a_union_name;

edited Aug 04 '12 at 10:43

answered Aug 04 '12 at 06:50

huseyin tugrul buyukisik

11,469
4
45
97

1

accessing member of union willy nilly is usually undefined behavior. – Joel Falcou Aug 04 '12 at 06:55
Don't forget to test afterwards to verify that this technique actually speeds things up by an amount that is great enough to make the non-portability worthwhile. You may well be surprised at how efficient your computer's FPU is, compared to twiddling the bits 'by hand'. – Jeremy Friesner Aug 04 '12 at 06:56
yes, modern cpus can divide in 1 clock. But important thing is to make same thing in a slow computer. – huseyin tugrul buyukisik Aug 04 '12 at 06:56
he wants high performance because he doesnt have a multicore or avx cpu i think – huseyin tugrul buyukisik Aug 04 '12 at 06:57
1

Don't forget pipelining latency. Yes you fire a divide per cycle but it usually takes 10-15 cyles to get the result. – Joel Falcou Aug 04 '12 at 06:58
@Joel: And you can do other calculations at the same time (10-15 cycles) you dont have to wait for fpu if you need integer operations. – huseyin tugrul buyukisik Aug 04 '12 at 06:59
1

Yes but claiming division takes one cycle is misleading. – Joel Falcou Aug 04 '12 at 07:03
if you make it vectorized instructions such as "mulps" or intrinsics, can you take it 2-5 cycles per double? – huseyin tugrul buyukisik Aug 04 '12 at 07:08
1

you're mixing latency and throughput of instructions. Using the SIMD unit changes both + adds the speed-up due to the width of the registers. – Joel Falcou Aug 04 '12 at 07:14
mulps: "reciprocal througput" is 2 on my "pentium m". 1 multiplication per 2 cycles (per float)am i right? The latency is 4. Whatever i multiply it takes 4 cycles to finish. Taking with a speed of reciprocal throughput(smaller better) and calculating with a speed of latency(smaller is better) – huseyin tugrul buyukisik Aug 04 '12 at 07:18

amdn · Answer 3 · 2012-08-04T15:52:54.073

Floating point addition and multiplication typically take few cycles in modern processors.

Perhaps you should step back and think about what the algorithm is doing. In your example you have a double nested loop... that means "somefunction()" might get called many many times. The common representation of "double" is IEEE which uses 11 bits for the exponent and 52 bits for the mantissa (53 really because except for zero there is an implied '1'). What this means is you can represent numbers to 53 bits precision in a range from very tiny to very large numbers - the binary "floating point" can move 1024 (2^10) places to the left or right of the number "1.0" ... if "somefunction()" is called a thousand times and it always returns a number less than or equal to 0.5 you underflow (every time you multiply by 0.5 you cut your number "a" in half, meaning you move the binary floating point to the left. On x86 you can tell the processor to "flush denormals to zero" by setting a bit in a control register - there is no portable programming interface for doing this, with gcc you use

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Telling the processor to flush denormals to zero will make your code run faster as the processor doesn't try to represent numbers that are beyond (smaller than) normals (sub-normals, or denormals). It seems you are trying to maintain precision in the face of an algorithm that is producing sub-normals (which forces loss of precision). How best to handle this depends on whether you control "somefunction()" or not. If you do have control of that function, then you might "normalize" the value it returns to something in the range

0.5 <= X <= 2.0

In other words, return values centered around 1.0 and keep track separately of the power of 2 that you need to multiply the final answer to scale it properly.

harold · Answer 4 · 2012-08-04T15:17:15.570

1

If you're using SSE, adding constants directly to the exponent field is a legitimate trick (in FPU code it's quite terrible) - it typically has double the throughput and a 4 times better latency (except on processors that have a float->int and/or int->float penalty). But since you're just doing this to prevent denormals, why not turn on FTZ (flush to zero) and DAZ (denormals are zero)?

edited Aug 04 '12 at 15:17

answered Aug 04 '12 at 15:12

harold

61,398
6
86
164

Joel Falcou · Answer 5 · 2012-08-04T07:05:34.813

0

You can use the standard frexp/ldexp functions that break up IEE 754 values into their components :

http://www.cplusplus.com/reference/clibrary/cmath/frexp/

http://www.cplusplus.com/reference/clibrary/cmath/ldexp/

Here is a simple sample code:

#include <cmath>
#include <iostream>

int main ()
{
  double value = 5.4321;
  int exponent;

  double significand = frexp (value , &exponent);
  double result = ldexp (significand , exponent+1);

  std::cout << value << " -> " << result  << "\n";
  return 0;
}

Execution deals with : http://ideone.com/r3GBy

edited Aug 04 '12 at 07:05

answered Aug 04 '12 at 06:54

Joel Falcou

6,247
1
17
34

Thanks. this didnt answer my question (i asked for the forms of constants), but has achieved my goals. – ggg Aug 04 '12 at 07:34
2

frexp and ldexp are likely to take longer than a simple floating-point multiplication, as they must perform integer operations on a floating-point encoding. This involves a number of tests and arithmetic operations and, on some modern processors, requires moving data between floating-point and integer units inside the processor. At best, if the compiler recognizes the exponent is a constant, it might optimize away the call, replacing it with a floating-point multiplication. For this question, a simple multiplication suffices. Just multiply directly by an exact power of two, e.g., 0x1p1024. – Eric Postpischil Aug 04 '12 at 10:27
@EricPostpischil I compeltely agree, my concern was more to point to these functiosn thatn havign OP rely on union tricks as stated above. Now, of course, the multiplication is far more than enough. – Joel Falcou Aug 04 '12 at 10:34

stark · Answer 6 · 2012-08-04T14:37:56.077

On a gigahertz processor you may save 1 or 2 nanoseconds by optimizing this way (shift vs. arithmetic). However, the time it takes to load and store from memory is on the order of 100 nsecs, and to disk is 10 msecs. Worrying about the arithmetic operations is pointless compared to optimizing cache usage and disk activity. It will never make a difference in any real production program.

Just to prevent misunderstanding, I'm not saying the difference is small so don't worry about it, I'm saying that its zero. You can't write a simple program where the difference in ALU time is not completely overlapped with the time the CPU has stalled waiting for memory or I/O.

Forms of constants for high performance addition and multiplication for double

6 Answers6

Linked