Time series stacking - C algorithm optimisation

Question

I'm working on a problem where I want to stack time series recorded at different locations and extract the coherent signal. The heavy lifting is done in C, with a Python wrapper to provide a more friendly interface. I've reached the point where I am satisfied with the theoretical correctness of the algorithm and would like to optimise it as much as possible. I understand enough C to write something that works and is parallelised with openMP, but not much beyond that.

Optimisation of the problem is important as I am dealing with large datasets - up to 200 time series to stack, sampling rates up to 1000Hz, order of months to years of recordings. Processing can run into the days to weeks with reasonable computational facilities. I am running this stacking step on chunks of the continuous time series as to not swamp memory.

I have a few questions:

Is there anything obvious I am missing that will help (optimisation through the compiler, streamlining the algorithm)?
The most significant gain made so far was with the optimisation flag -Ofast - I have read around and just wanted to understand a bit more why this is faster and whether it is 'safe' for my purposes?
Where (beyond trawling through SO) should I look to learn more about this sort of problem? I have other problems in my research that I would like to tackle by using C!

Algorithm

I am stacking the time series from each location continuously through time in a 3-D gridded volume. Once the full stack is finished for a given cell I need to exponentiate the result and normalise by the number of contributing time series.

#define MAX(a,b) (((a)>(b))?(a):(b))

EXPORT void migrate(double *sigPt, int32_t *indPt, double *mapPt, int32_t fsmp, int32_t lsmp, int32_t nsamp, int32_t nstation, int32_t avail, int64_t ncell, int64_t threads)
{
    double  *stnPt, *stkPt, *eStkPt;
    int32_t *ttpPt;
    int32_t ttp;
    int32_t to, tm, st;
    int64_t cell;

    #pragma omp parallel for private(cell,stkPt,eStkPt,ttpPt,st,ttp,tm) num_threads(threads)
    for (cell=0; cell<ncell; cell++)
    {
        stkPt = &mapPt[cell * (int64_t) nsamp];
        eStkPt = &mapPt[cell * (int64_t) nsamp];
        ttpPt = &indPt[cell * (int64_t) nstation];
        for(st=0; st<nstation; st++)
        {
            ttp   = MAX(0,ttpPt[st]);
            stnPt = &sigPt[st*(fsmp + lsmp + nsamp) + ttp + fsmp];
            for(tm=0; tm<nsamp; tm++)
            {
                stkPt[tm] += stnPt[tm];
            }
        }
        for(tm=0; tm<nsamp; tm++)
        {
            eStkPt[tm] = exp(stkPt[tm] / avail);
        }
    }
}

I am currently compiling with:

gcc -shared -fPIC -std=gnu99 ./source.c -fopenmp -Ofast -lm -o ./output

I have read:

Profiling python C extensions

What GCC optimization flags and techniques are safe across CPUs?

among others. Apologies if I am repeating a question/my question is poorly defined.

It may be helpful to make those pointers local to parallel for and dispense with private clause. You would want to assure that the inner loop is simd optimized and perhaps use a simd library for exp(). — tim18, Apr 21 '20 at 12:50
Ok, I will look into that. My naive attempt to use simd with #pragma omp simd on the exponential step appears to have slowed things down so I'll go do some reading. Thanks. — hemmelig, Apr 21 '20 at 13:20

score 0 · Answer 1 · answered Apr 24 '20 at 18:49

Is there anything obvious I am missing that will help (optimisation through the compiler, streamlining the algorithm)?

The proposed code seems quite good (GCC should be able to vectorize it and the parallelization seems ok). But here are some possible advises to improve the performance:

exp(stkPt[tm] / avail) can be rewritten to the probably faster expression pow(availFactor, stkPt[tm]) with availFactor a constant defined outside the loop and set to exp(1.0 / avail). As proposed by @tim18, you should also check that this loop is vectorized because exponential are generally slow to compute.
you can try to use the compiler flag -march=native to make the code a bit faster at the expense of a less portable binary (if you do not want to loose portability for old processors, you can try -mtune=native which is generally not as good as -march=native). Regarding your processor, this option can enable GCC to use AVX and FMA instructions (available on relatively recent processors) that should speed up your code. You can alternatively enable such features manually with -mavx and -mfma.
You can try to adapt your algorithm so that the hot loop containing stkPt[tm] += stnPt[tm] works mainly on in-cache data. This point is very important. Indeed, this hottest part of the algorithm is probably memory-bound. A good starting point is to do tiling (for example by working on 2 or 4 nstation at the same time).
Consider working with simple-precision floating-point types rather than double-precision if the resulting accuracy is good enough.

Do not forget to check the results since these optimizations could impact the accuracy of your code !

The most significant gain made so far was with the optimisation flag -Ofast - I have read around and just wanted to understand a bit more why this is faster and whether it is 'safe' for my purposes?

Using -Ofast is unsafe. Indeed, this option enable -ffast-math which enable further options like -funsafe-math-optimizations or -ffinite-math-only. Thus, your code should not deal with NaN values, Infinity and the accuracy of computations can be much lower. It may or may not be a problem regarding your expectations. Note that this option help GCC to speed up the exponential computations (thanks to vectorization).

Great, ok - this certainly gives me some leaping off points. I looked into vectorizing that inner loop, but if I'm honest I just don't think I have a solid enough grasp of C, simd, etc to do so. There appeared to be some vectorized libraries that would perform the exponentiation (eg MKL), but I was unable to figure out how to include them in my code. Will it matter if the item I tile over (such as nstations) is not divisible by 2/4 etc? I sanitise the arrays I am stacking by clipping between sensible values and the nature of the problem should also exclude NaN values on the way in. — hemmelig, Apr 24 '20 at 19:23
For SIMD, you can use `#pragma omp simd` to help the compiler vectorizing loops (it is just a hint). The Intel compiler is relatively good for that (including math function). The Intel VML library provide useful vectorized math functions, but I do not know if ICC, the VML or even the MKL can do that without a loss of accuracy. Actually, vectorizing manually math calls efficiently with a strict IEEE754 compliance is very difficult. So, I advise you to first check if using `-Ofast` is actually an issue (because I think GCC already do a quite good job with this). — Jérôme Richard, Apr 24 '20 at 20:08
For the tiling, you can have two loops: one that iterate over `nstations` with a stride of 4 (up to `(nstations/4)*4` excluded) and another that does the remaining work. For trivial cases `#pragma omp simd` can do that for you. What are a typical value for `nstation` and `nsamp` ? If it is they are big, I advise you to focus on the tiling optimization first. — Jérôme Richard, Apr 24 '20 at 20:19

Time series stacking - C algorithm optimisation

1 Answers1