Fast and precise implementation of pow(complex, 2)

Question

I have a code in which I perform many operations of the form

double z_sq = pow(abs(std::complex<double> z), 2);

to calculate |z|² of a complex number z, where performance and precision are my major concerns. I read that the pow() function of C++ converts int exponents to double and thus introduces numerical errors apart from also having worse performance than x*x in general (discussed here for real numbers). Similarly, the std::pow() method of <complex> converts the exponent to std::complex<double> unnecessarily.

My first question is how to implement the complex square Re²+Im² or zz* with best-possible performance and precision. I thought about doing something like

double complex_square1(std::complex<double> z) {
    double abs_z = abs(z);
    return abs_z * abs_z;
}

or (which gives a type error, but would be the most straight-forward way, I can think of)

double complex_square2(std::complex<double> z) {
    return z * conj(z);
}

My second question is, whether for many (10^14) such operations, the function calls can decrease performance noticably. So would it be better in the case of complex_square1 to write abs(z) * abs(z) into the .cpp file instead of defining a function for it? Are there any other better ways to do it?

Your description is not consistent. In some places, you're concerned about calculating the magnitude of a complex number, in others the magnitude squared. If calculating the latter, simply return `z.real*z.real + z.imag*z.imag`. In both your approaches, your code creates an additional tempoirary object which (although `std::complex` is not a huge structure) is unlikely to be more efficient than simply doing the calculation based on the real and imaginary parts of `z`. — Peter, Oct 22 '21 at 22:11
Im actually not concerned about the magnitude of z - I only calculated the absolute to obtain the square of it this way. But your suggestion gives a much better performance than what I had in mind, so thanks a lot!! I will post a script with which I compare the performance of the fastest ways I found to the std::pow() method, for anyone interested — Azure27, Oct 22 '21 at 22:54
Pretty sure [`std::norm(z)`](https://en.cppreference.com/w/cpp/numeric/complex/norm) is the canonical solution to this question. — Patrick Roberts, Oct 22 '21 at 23:05
Haha wow, how could I not know about this... Thanks, I added your suggestion to the comparison and found thats it about 4 times slower than calling complex_square1(z) — Azure27, Oct 22 '21 at 23:13

score 2 · Accepted Answer · answered Oct 22 '21 at 23:00

I think it would be hard to beat just taking the sum of the squares of the imaginary and real parts.

Below I measure it being about 5x faster than actually calculating the magnitude and squaring it:

#include <complex>
#include <chrono>
#include <random>
#include <vector>
#include <iostream>

double square_of_magnitude_1( std::complex<double> z) {
    auto magnitude = std::abs(z);
    return magnitude * magnitude;
}

double square_of_magnitude_2( std::complex<double> z) {
    return z.imag() * z.imag() + z.real() * z.real();
}


volatile double v; // assign to volatile so calls do not get optimized away.
std::random_device rd;
std::mt19937 e2(rd());
std::uniform_real_distribution<double> dist(0, 10);

int main() {
    using std::chrono::high_resolution_clock;
    using std::chrono::duration_cast;
    using std::chrono::duration;
    using std::chrono::microseconds;

    std::vector<std::complex<double>> numbers(10000000);
    std::generate(numbers.begin(), numbers.end(), []() {return std::complex<double>(dist(e2), dist(e2)); });

    auto t1 = high_resolution_clock::now();
    for (const auto& z : numbers) {
        v = square_of_magnitude_1(z);
    }
    auto t2 = high_resolution_clock::now();
    std::cout << "square_of_magnitude_1 => " << duration_cast<microseconds>(t2 - t1).count() << "\n";


    t1 = high_resolution_clock::now();
    for (const auto& z : numbers) {
        v = square_of_magnitude_2(z);
    }
    t2 = high_resolution_clock::now();
    std::cout << "square_of_magnitude_2 => " << duration_cast<microseconds>(t2 - t1).count() << "\n";

}

A typical run of the above yields

square_of_magnitude_1 => 54948
square_of_magnitude_2 => 9730

Thanks a lot! This coincides well with my performance checks, which I have posted below. Also I think your code is missing an "#include " for std::generate to work — Azure27, Oct 22 '21 at 23:38
[`std::norm(z)`](https://en.cppreference.com/w/cpp/numeric/complex/norm) should be approximately as fast as your hand-written approach, as observed [here](https://godbolt.org/z/baErcd6M1). — Patrick Roberts, Oct 22 '21 at 23:51

Azure27 · Answer 2 · 2021-10-24T13:35:14.030

Based on Peter's answer (-thanks again), which is the fastest of the suggested methods, I supplement my comparison code for anyone interested:

#include <complex>
#include <chrono>

using complex = std::complex<double>;

double complex_square1(complex z) {
    return z.real()*z.real() + z.imag()*z.imag();
}


double complex_square2(complex z) {
    double abs_z = abs(z);
    return abs_z * abs_z;
}


complex z = complex(0.5, 0.5);

unsigned int N = (int)1e7;
double z_sq;

int main() {
    auto start = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < N; ++i) {
        z_sq = z.real()*z.real() + z.imag()*z.imag();       // fastest
        // z_sq = complex_square1(z);                       // slower by a factor 1.6 WITHOUT OPTIMIZATION
        // z_sq = complex_square2(z);                       // slower by a factor 5.5
        // z_sq = pow(abs(z), 2);                           // slower by a factor 6
        // z_sq = norm(z);                                  // slower by a factor 5
    }

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    printf("Duration: %d ms\n", duration);
    printf("z_sq = %f", z_sq);
}

This also answers my second question: The function calls become indeed noticable when compiling without optimization, but only by a fractional increase in computation time of about 60%, measured on my Windows machine and using the GCC compiler.

With optimization, the two first lines in the benchmark are equally fast as pointed out by Patrick Roberts.

`(z*z).real()` does not compute the same result as summing the squares of real and imaginary parts (or as `pow(abs(z), 2)`) — Peter, Oct 22 '21 at 23:07
Oh yeah, that explains why the other comment got deleted.. I will fix that, cheers — Azure27, Oct 22 '21 at 23:16
Based on your results I'm guessing you're running these benchmarks without optimizations. The first commented line should be equally fast. Check out the results [here](https://godbolt.org/z/baErcd6M1). — Patrick Roberts, Oct 22 '21 at 23:47

score 0 · Answer 3 · answered Oct 25 '21 at 09:42

Based on the answer of @Azure27, I have rearranged the code into a solution that uses Google Benchmark.

#include <benchmark/benchmark.h>
#include <complex>
#include <chrono>

using complex = std::complex<double>;

double complex_square1(complex z) {
    return z.real()*z.real() + z.imag()*z.imag();
}


double complex_square2(complex z) {
    const double abs_z = abs(z);
    return abs_z * abs_z;
}


complex z{0.5, 0.5};
unsigned int N{(int)1e7};
volatile double z_sq;

int ExplicitSquareCalculation() 
{
    for (size_t i = 0; i < N; ++i) {
        z_sq = z.real()*z.real() + z.imag()*z.imag();       // fastest
    }

    // printf("z_sq = %f", z_sq);
    return 0;
}

static void BM_ExplicitSquareCalculation(benchmark::State& state) {
  for (auto _ : state) {
    ExplicitSquareCalculation();
  }
}

int ComplexSquare1() 
{
    for (size_t i = 0; i < N; ++i) {
        z_sq = complex_square1(z);                       // slower by a factor 1.6 WITHOUT OPTIMIZATION
    }

    // printf("z_sq = %f", z_sq);
    return 0;
}

static void BM_ComplexSquare1(benchmark::State& state) {
  for (auto _ : state) {
    ComplexSquare1();
  }
}

int ComplexSquare2() 
{
    for (size_t i = 0; i < N; ++i) {
        z_sq = complex_square2(z);                       // slower by a factor 5.5
    }

    // printf("z_sq = %f", z_sq);
    return 0;
}

static void BM_ComplexSquare2(benchmark::State& state) {
  for (auto _ : state) {
    ComplexSquare2();
  }
}

int SquareOfAbsolute() 
{
    for (size_t i = 0; i < N; ++i) {
        z_sq = pow(abs(z), 2);                           // slower by a factor 6
    }

    // printf("z_sq = %f", z_sq);
    return 0;
}

static void BM_SquareOfAbsolute(benchmark::State& state) {
  for (auto _ : state) {
    SquareOfAbsolute();
  }
}

int StdNormalize() 
{
    for (size_t i = 0; i < N; ++i) {
        z_sq = norm(z);                                  // slower by a factor 5
    }

    // printf("z_sq = %f", z_sq);
    return 0;
}

static void BM_StdNormalize(benchmark::State& state) {
  for (auto _ : state) {
    StdNormalize();
  }
}

BENCHMARK(BM_ExplicitSquareCalculation);
BENCHMARK(BM_ComplexSquare1);
BENCHMARK(BM_ComplexSquare2);
BENCHMARK(BM_SquareOfAbsolute);
BENCHMARK(BM_StdNormalize);

BENCHMARK_MAIN();

On Compiler Explorer, I am getting weird results. YMMV.

Fast and precise implementation of pow(complex, 2)

3 Answers3