g++ vs. optimization by hand for complex number multiplication

Question

In our code base we have a lot of operation like j*ω*X where j is the imaginary unit, ω is real and X is complex. Actually a lot of loops could look like:

#include <complex>
#include <vector>

void mult_jomega(std::vector<std::complex<double> > &vec, double omega){
    std::complex<double> jomega(0.0, omega);
    for (auto &x : vec){
        x*=jomega;
    }
}

However, we exploit the fact that the real part of jomega is zero and write the multiplication as:

void mult_jomega_smart(cvector &vec, double omega){
    for (auto &x : vec){
        x={-omega*x.imag(), omega*x.real()};
    }
}

At the beginning, I was dismissive of this "smart"-version, because

it is harder to understand.
the probability of bugs is higher.
"the compiler will optimize it anyway".

However, as some performance-regressions have shown, the third argument doesn't hold water. When comparing these two functions (see listings further below), the smart version outperforms consistently with -O2 as well as with -O3:

size    orig(musec)   smart(musec)  speedup
10      0.039928      0.0117551     3.39665
100     0.328564      0.0861379     3.81439
500     1.62269       0.417475      3.8869
1000    3.33012       0.760515      4.37877
2000    6.46696       1.56048       4.14422
10000   32.2827       9.2361        3.49528
100000  326.828       115.158       2.8381
500000  1660.43       850.415       1.95249

The smart version is about 4 times faster on my machine (gcc-5.4), and only as the task becomes more and more memory-bound with increasing size of the array the speed-up drops to factor 2.

My question is, what prevents the complier from optimizing the less smart but more readable version, after all, the compiler can see, that the real part of jomega is zero? Is it possible to help compiler with optimization by giving some additional compile-flags?

NB: Speedup exists also for other compilers:

compiler      speedup
g++-5.4          4
g++-7.2          4
clang++-3.8      2  [original version 2-times faster than gcc]

Listings:

mult.cpp - to prevent inlining:

#include <complex>
#include <vector>

typedef std::vector<std::complex<double> > cvector;
void mult_jomega(cvector &vec, double omega){
    std::complex<double> jomega(0.0, omega);
    for (auto &x : vec){
        x*=jomega;
    }
}

void mult_jomega_smart(cvector &vec, double omega){
    for (auto &x : vec){
        x={-omega*x.imag(), omega*x.real()};
    }
}

main.cpp:

#include <chrono>
#include <complex>
#include <vector>
#include <iostream>

typedef std::vector<std::complex<double> > cvector;
void mult_jomega(cvector &vec, double omega);
void mult_jomega2(cvector &vec, double omega);
void mult_jomega_smart(cvector &vec, double omega);


const size_t N=100000;   //10**5
const double OMEGA=1.0;//use 1, so nothing changes -> no problems with inf & Co

void compare_results(const cvector &vec){
   cvector m=vec;
   cvector m_smart=vec;
   mult_jomega(m, 5.0);
   mult_jomega_smart(m_smart,5.0);
   std::cout<<m[0]<<" vs "<<m_smart[0]<<"\n";
   std::cout<< (m==m_smart ? "equal!" : "not equal!")<<"\n";

}

void test(size_t vector_size){

     cvector vec(vector_size, std::complex<double>{1.0, 1.0});

     //compare results, triger if in doubt
     //compare_results(vec);


     //warm_up, just in case:
     for(size_t i=0;i<N;i++)
        mult_jomega(vec, OMEGA);

     //test mult_jomega:
     auto begin = std::chrono::high_resolution_clock::now();
     for(size_t i=0;i<N;i++)
        mult_jomega(vec, OMEGA);
     auto end = std::chrono::high_resolution_clock::now();
     auto time_jomega=std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e3;


     //test mult_jomega_smart:
     begin = std::chrono::high_resolution_clock::now();
     for(size_t i=0;i<N;i++)
        mult_jomega_smart(vec, OMEGA);
     end = std::chrono::high_resolution_clock::now();
     auto time_jomega_smart=std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e3;

     double speedup=time_jomega/time_jomega_smart;
     std::cout<<vector_size<<"\t"<<time_jomega/N<<"\t"<<time_jomega_smart/N<<"\t"<<speedup<<"\n";
}


int main(){
   std::cout<<"N\tmult_jomega(musec)\tmult_jomega_smart(musec)\tspeedup\n";    
   for(const auto &size : std::vector<size_t>{10,100,500,1000,2000,10000,100000,500000})
        test(size);          
}

Building & running:

g++ main.cpp mult.cpp -O3 -std=c++11 -o mult_test
./mult_test

`jomega` is not `const`. Additionally the `operator *` most likely takes a `const &` and is allowed to `const_cast` the `const` away and change `jomega`, so it's not that certain that `jomega`'s real part is `0.0`. See if adding `const` to `jomega` does anything. — nwp, Mar 09 '18 at 08:16
@nwp I don't think that declaring something as `const` helps the compiler much to optimize your code. `0.0` is a literal, and constant propagation should happen anyway. There was an old gotw stating that const doesn't do anything for optimization (constexpr of course does): http://www.gotw.ca/gotw/081.htm. — Jens, Mar 09 '18 at 09:24
@Jens `const` not doing anything for optimization is generally only true for adding `const` (say, a function taking a `const &` or a `&` doesn't really matter), but a top-level `const` does actually mean `const` because casting it away is UB. That said obviously I was wrong and it didn't do much here either. — nwp, Mar 09 '18 at 09:40

Aziz · Accepted Answer · 2018-03-09T08:36:30.643

8

Compiling with the flag -ffast-math results in fast performance.

N       mult_jomega(musec)      mult_jomega_smart(musec)        speedup
10      0.00860809              0.00818644                      1.05151
100     0.0706683               0.0693907                       1.01841
500     0.29569                 0.297323                        0.994509
1000    0.582059                0.57622                         1.01013
2000    1.30809                 1.24758                         1.0485
10000   7.37559                 7.4854                          0.98533

Edit: More specifically, it's the -funsafe-math-optimizations compiler flag. According to the documentation, this flag is used to

allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When

Edit 2: Even more specifically, it is the -fno-signed-zeros option, which:

Allows optimizations for floating-point arithmetic that ignore the signedness of zero. IEEE arithmetic specifies the behavior of distinct +0.0 and -0.0 values, which then prohibits simplification of expressions such as x+0.0 or 0.0*x (even with -ffinite-math-only). This option implies that the sign of a zero result isn’t significant.

edited Mar 09 '18 at 08:36

answered Mar 09 '18 at 08:30

Aziz

20,065
8
63
69

1

`-fno-signed-zeros` is a really nice find. Since that is enough I would encourage to only use that flag and not `-ffast-math` unless extensive tests show that it doesn't break anything. – nwp Mar 09 '18 at 08:39
Do i understand correctly, that one problem among others is the case `omega=0.0` which could lead to differently signed zeros in result, eg the sign of the real part of the answer depending on real and imaginary part of the input? – ead Mar 09 '18 at 08:50
@ead: correct. The compiler cannot optimize the multiplication of zero without that flag because the result of `x*0.0` can be one of two "different" numbers: `0.0` or `-0.0`. The flag breaks this IEEE 754 standard rule and tells the compiler to treat zeros as unsigned. Hence, `x*0.0` is always `0.0`, and the compiler can optimize from there. – Aziz Mar 09 '18 at 08:56
@ead Thinking about this problem, it seems that the root issue is a dyssymmetrie in the treatment of arithmetic operations involving pure real numbers and complex, and arithmetic operations involving pure imaginiary number and complex. For float/complex operations, the imaginary part of the float is conceptualy a zero without signedness, for imaginary/complex operations, the zero of the real part of the imaginary is considered to have signedness. So I do believe you have pointed out a loophole in common complex implementations: they shall implement pure imaginary number as an independent type. – Oliv Mar 09 '18 at 09:17
@Oliv I think this is a great insight. Probably will roll out my own `pure_imaginary` implementation in order to have both: readability and performance (no matter which compiler flags) – ead Mar 09 '18 at 10:31
funnily, `-fno-signed-zeros` doesn't improve performance for my version of clang, but for g++. `clang++` really needs `-ffast-math'... – ead Mar 09 '18 at 10:33
@ead: clang doesn't have the `-fno-signed-zeros` flag – Aziz Mar 09 '18 at 11:01
@Aziz I played around with the example a little bit with the godbolt compiler explorer: https://godbolt.org/g/gwFkEQ. It seems that gcc 7.3 does not inline the call to `*=` in the loop, at least to my limited understanding of assembler code where I see `callq 400740 <__muldc3@plt>`. In addition, `-fno-signed-zeros` seems not sufficient to optimize the multiplication because it does not cover `nan`. For nan, `0.0 * nan` == nan, so the compiler cannot remove the multiplication because it is always 0. The same would be true for `[+-]inf`. You can see that in the second loop. – Jens Mar 09 '18 at 15:12

Jens · Answer 2 · 2018-03-12T07:37:42.910

I did some more investigation concerning the compiler options mentioned in Aziz's answer with the godbolt compiler explorer. The example code implements three versions of the inner loop:

The mult_jomega example.
A hand-written version of the same loop where the call to operator*= has benn replaced with the computation
The mult_jomega_smart example

The code from godbolt:

// 1. mult_jomega
std::complex<double> const jomega(0.0, omega);
for (auto &x : v){
    x*=jomega;
}

// 2. hand-written mult_jomega
for (auto &x : v3){
    double x1 = x.real() * jomega.real();
    double x2 = x.imag() * jomega.imag();
    double x3 = x.real() * jomega.imag();
    double x4 = x.imag() * jomega.real();
    x = { x1 - x2 , x3 + x4};
}

// 3. mult_jomega_smart
for (auto &x : v2){
    x={-omega*x.imag(), omega*x.real()};
}

Inspecting the assembler code for the three loops:

mult_jomega

 cmp %r13,%r12
 je 4008ac <main+0x10c>
 mov %r12,%rbx
 nopl 0x0(%rax)
 pxor %xmm0,%xmm0
 add $0x10,%rbx
 movsd -0x8(%rbx),%xmm3
 movsd -0x10(%rbx),%xmm2
 movsd 0x8(%rsp),%xmm1
 callq 400740 <__muldc3@plt>
 movsd %xmm0,-0x10(%rbx)
 movsd %xmm1,-0x8(%rbx)
 cmp %rbx,%r13
 jne 400880 <main+0xe0>

hand-written multiplication

 cmp %rdx,%rdi
 je 40090c <main+0x16c>
 pxor %xmm3,%xmm3
 mov %rdi,%rax
 movsd 0x8(%rsp),%xmm5
 nopl 0x0(%rax,%rax,1)
 movsd (%rax),%xmm0
 movapd %xmm5,%xmm4
 movsd 0x8(%rax),%xmm1
 add $0x10,%rax
 movapd %xmm0,%xmm2
 mulsd %xmm5,%xmm0
 mulsd %xmm1,%xmm4
 mulsd %xmm3,%xmm2
 mulsd %xmm3,%xmm1
 subsd %xmm4,%xmm2
 addsd %xmm1,%xmm0
 movsd %xmm2,-0x10(%rax)
 movsd %xmm0,-0x8(%rax)
 cmp %rax,%rdx
 jne 4008d0 <main+0x130>

mult_jomega_smart

cmp %rcx,%rdx
 je 400957 <main+0x1b7>
 movsd 0x8(%rsp),%xmm2
 mov %rcx,%rax
 xorpd 0x514(%rip),%xmm2 # 400e40 <_IO_stdin_used+0x10>
 nopl 0x0(%rax)
 add $0x10,%rax
 movsd 0x8(%rsp),%xmm0
 movsd -0x8(%rax),%xmm1
 mulsd -0x10(%rax),%xmm0
 mulsd %xmm2,%xmm1
 movsd %xmm1,-0x10(%rax)
 movsd %xmm0,-0x8(%rax)
 cmp %rax,%rdx
 jne 400930 <main+0x190>

My understanding of assembler code is quite limited, but I see

operator*= does not get inlined in mult_jomega
x1 and x4 get computed, although they are always 0.0 because jomega.real()==0.0

I don't know why operator*=doe snot get inlined. The source code is straight forward and contains three lines only.

The computation of x1 and x4 can be explained when you consider that 0.0 * x == 0.0 is not always true for values of type double. In addition to the signed zero definition mentioned in the other answer there are infinite values nan and inf where x * 0.0 = 0.0 does not hold.

If compiled with -fno-signed-zeros and -ffinite-math-only the optimization apply and computations of x1 and x4 are removed.

score 0 · Answer 3 · answered Sep 24 '19 at 05:36

My mistake was, as pointed out by other answers, was to assume, that pure imaginary j*ω has the same behavior as complex 0.0+j*ω - in the same way as pure real 1.0 doesn't have the same behavior as the complex 1.0+0.0j, e.g. (live with gcc)

1.0*(inf+0.0j) = inf+0.0j
(1.0 +0.0j)*(inf+0.0j) = inf - nanj

which is due to the fact, that the complex multiplication is more complex than the school formula suggests.

Basically, there is a asymmetry in how complex numbers are handled by c++-compilers: there are pure real numbers (i.e. double) but no pure imaginary numbers.

C++-standard doesn't not define how complex-number multiplication must happen. Most compilers fall back to their implementation of C99. C99 is the first C-format defining how complex-number operations could be performed in Annex G. However, as noted in G.1.1, it support it optional:

G.1.1 ... Although these specifications have been carefully designed, there is little existing practice to validate the design decisions. Therefore, these specifications are not normative, but should be viewed more as recommended practice...

C99 defines also a pure imaginary data types float _Imaginary, double _Imaginary and long double _Imaginary (G.2), which is exactly what we need for j*ω. G.5.1 defines the multiplication semantics for pure imaginary as

xj*(v+wj) = -xw+(xv)j
(v+wj)*xj = -wx+(vx)j

i.e. the school formula is good enough (differently to multiplication of two complex numbers).

The problem is, that up to this point, none of the known compilers (gcc-9.2, clang-9.0) has support for _Imaginary type (as it is optional).

Thus my solution was to implement a pure_imaginary type and to overload the operators following G.5.1.

g++ vs. optimization by hand for complex number multiplication

3 Answers3