16

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be

for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.

arc_lupus
  • 3,942
  • 5
  • 45
  • 81
  • You can look at http://stackoverflow.com/questions/23200049/optimizing-assembly-generated-by-microsoft-visual-studio-compiler?rq=1, which optimizes the operations for cache use. – Paul Ogilvie Nov 10 '15 at 10:37
  • 2
    Parallelization or offloading the computation to a GPU might help, it depends on the size of the inputs. A manual SIMD implementation probably will as well, especially if a fast approximate square-root might suffice. Also are you sure that you really need the square root and that the your next calculation might not settle for directly using the sums of the squares, say for comparison in magnitude? – doynax Nov 10 '15 at 10:39
  • You may want to take a look at [this](http://stackoverflow.com/questions/31915239/how-do-i-compute-absolute-value-of-vector). – Rajdeep Paul Nov 10 '15 at 10:53
  • @doynax: I need the exact value, that is the problem. – arc_lupus Nov 10 '15 at 11:05
  • @arc_lupus: I'm afraid that the complex absolute calculation is inherently inexact. Perhaps you can look to a polar representation or evaluating your calculations symbolically somehow. Are you certain that you can't settle for the 15-odd digits of precision offered by IEEE-754 double-precision floating-point arithmetic? – doynax Nov 10 '15 at 11:38
  • @doynax: Of course, that should be enough, but my answer was rather aimed towards your suggestion with the sum of squares. – arc_lupus Nov 10 '15 at 11:42
  • @doynax a GPU won't help here unless the array is already on the GPU and you are doing more computations there. Otherwise, the transfer times will dominate. – Davidmh Nov 10 '15 at 12:59
  • What values of N are you interested it? Your operation is memory bandwidth bound for large N. For large N the best solution is not to do this operation. You need to find a way to do this operation in chunks along with other computationally intensive operations. – Z boson Nov 11 '15 at 08:11
  • @Zboson: My N is in the range of 2^14 to 2^20 – arc_lupus Nov 11 '15 at 09:50
  • What hardware are you interested in? What CPU? What OS and what compiler? Is this for double or float? – Z boson Nov 11 '15 at 09:54
  • @Zboson: Hardware: Currently i7 Sandy Bridge, soon i7 Skylake, GPU possible, but later. OS: Unix (Ubuntu 14.04, Arch) with gcc 4.8/gcc 5. Preferably double, but float also possible. – arc_lupus Nov 11 '15 at 10:28
  • Actually, your N is not as large as I though. That should fall between the L2 and L3 cache so you probably will some parallelizing with threads. – Z boson Nov 11 '15 at 11:23
  • @Zboson: 2^14 complex doubles are ~300 kByte, so that should easily fit into the L2 cache (6 MByte). – arc_lupus Nov 11 '15 at 11:43
  • @arc_lupus, assuming you have a L3 cache then the L2 cache is only 256k. But when you split 300KB between several cores then it will easily fit in the L2 cache of each core. But for 2^20 it will be between L2 and L3 as I said. – Z boson Nov 11 '15 at 13:54

6 Answers6

14

Given that all loop iterations are independent, you can use the following code for parallelization:

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.

alexeykuzmin0
  • 6,344
  • 2
  • 28
  • 51
  • 1
    Also, you can use the SIMD options of OpenMP, see http://stackoverflow.com/questions/14674049/parallel-for-vs-omp-simd-when-to-use-each. I expanded a bit on that in another answer. – Andrew Henle Nov 10 '15 at 11:42
5

Or use Concurrency::parallele_for like that :

Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Jerome
  • 529
  • 4
  • 14
5

Use vector operations.

If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.

Libmvec is vector math library added in Glibc 2.22.

Vector math library was added to support SIMD constructs of OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding vector implementations of vector math functions.

Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX for x86_64). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines.

Also, see Parallel for vs omp simd: when to use each?

If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:

Description

These functions evaluate the function hypot(x, y) for an entire vector of values at once. ...

The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.

Community
  • 1
  • 1
Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
5

Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use

//struct of arrays of four complex numbers
struct c4 {
    float x[4];  // real values of four complex numbers 
    float y[4];  // imaginary values of four complex numbers
};

Here is code showing how you could do this with SSE for the x86 instruction set.

#include <stdio.h>
#include <x86intrin.h>
#define N 10

struct c4{
    float x[4];
    float y[4];
};

static inline void cabs_soa4(struct c4 *a, float *b) {
    __m128 x4 = _mm_loadu_ps(a->x);
    __m128 y4 = _mm_loadu_ps(a->y);
    __m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
    _mm_storeu_ps(b, b4);
}  

int main(void)
{
    int n4 = ((N+3)&-4)/4;  //choose next multiple of 4 and divide by 4
    printf("%d\n", n4);
    struct c4  a[n4];  //array of struct of arrays
    for(int i=0; i<n4; i++) {
        for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
    }
    float b[4*n4];
    for(int i=0; i<n4; i++) {
        cabs_soa4(&a[i], &b[4*i]);
    }
    for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}

It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this

for(int i = 0; i < nchunks; i++) {
    for(int j = 0; j < chunk_size; j++) {
        b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
    }
    foo(&b[i*chunck_size]); // foo is computationally intensive.
}

I did not implement an array of struct of array here but it should be easy to adjust the code for that.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Thanks for the idea with chunking, but I need the whole array for processing it later (applied fft), thus it is not possible for me to do that. – arc_lupus Nov 11 '15 at 10:32
4

If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.

So, if you want to run them in parallel you would do:

#include <cilk/cilk.h>

cilk_for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

or if you want to test SIMD:

#pragma simd
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

But, the nicest part of Cilk is that you can just do:

b[:] = cabs(a[:])

In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel). Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.

Davidmh
  • 3,797
  • 18
  • 35
  • Two questions: How does the compiler know the size of the arrays if the arrays are malloc'ed in the last example? And how can I enable the #pragma? – arc_lupus Nov 10 '15 at 14:32
  • @arc_lupus the parallelism is ultimately determined by a runtime scheduler. The optimal scheduling depends not only on the size of the array, but also on how fast your `cabs` function is (pretty fast in this case, potentially very slow for arbitrary cases, where you would like individual parallelism). – Davidmh Nov 10 '15 at 14:41
  • @arc_lupus `#include ` – Davidmh Nov 10 '15 at 14:42
2

Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:

#include <future>

...

int main()
{
    ...

    // Create async calculations
    std::future<void> *futures = new std::future<void>[N];
    for (int i = 0; i < N; ++i)
    {
        futures[i] = std::async([&a, &b, i]
        {
            b[i] = std::sqrt(a[i]);
        });
    }
    // Wait for calculation of all async procedures
    for (int i = 0; i < N; ++i)
    {
        futures[i].get();
    }

    ...

    return 0;
}

IdeOne live code

We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

alexeykuzmin0
  • 6,344
  • 2
  • 28
  • 51
  • `cabs` is the complex `abs` defined in the C99-standard of C. Furthermore: Is sqrt not quite slow compared to other methods? – arc_lupus Nov 10 '15 at 11:28