I'm working on a problem where I want to stack time series recorded at different locations and extract the coherent signal. The heavy lifting is done in C, with a Python wrapper to provide a more friendly interface. I've reached the point where I am satisfied with the theoretical correctness of the algorithm and would like to optimise it as much as possible. I understand enough C to write something that works and is parallelised with openMP, but not much beyond that.
Optimisation of the problem is important as I am dealing with large datasets - up to 200 time series to stack, sampling rates up to 1000Hz, order of months to years of recordings. Processing can run into the days to weeks with reasonable computational facilities. I am running this stacking step on chunks of the continuous time series as to not swamp memory.
I have a few questions:
Is there anything obvious I am missing that will help (optimisation through the compiler, streamlining the algorithm)?
The most significant gain made so far was with the optimisation flag -Ofast - I have read around and just wanted to understand a bit more why this is faster and whether it is 'safe' for my purposes?
Where (beyond trawling through SO) should I look to learn more about this sort of problem? I have other problems in my research that I would like to tackle by using C!
Algorithm
I am stacking the time series from each location continuously through time in a 3-D gridded volume. Once the full stack is finished for a given cell I need to exponentiate the result and normalise by the number of contributing time series.
#define MAX(a,b) (((a)>(b))?(a):(b))
EXPORT void migrate(double *sigPt, int32_t *indPt, double *mapPt, int32_t fsmp, int32_t lsmp, int32_t nsamp, int32_t nstation, int32_t avail, int64_t ncell, int64_t threads)
{
double *stnPt, *stkPt, *eStkPt;
int32_t *ttpPt;
int32_t ttp;
int32_t to, tm, st;
int64_t cell;
#pragma omp parallel for private(cell,stkPt,eStkPt,ttpPt,st,ttp,tm) num_threads(threads)
for (cell=0; cell<ncell; cell++)
{
stkPt = &mapPt[cell * (int64_t) nsamp];
eStkPt = &mapPt[cell * (int64_t) nsamp];
ttpPt = &indPt[cell * (int64_t) nstation];
for(st=0; st<nstation; st++)
{
ttp = MAX(0,ttpPt[st]);
stnPt = &sigPt[st*(fsmp + lsmp + nsamp) + ttp + fsmp];
for(tm=0; tm<nsamp; tm++)
{
stkPt[tm] += stnPt[tm];
}
}
for(tm=0; tm<nsamp; tm++)
{
eStkPt[tm] = exp(stkPt[tm] / avail);
}
}
}
I am currently compiling with:
gcc -shared -fPIC -std=gnu99 ./source.c -fopenmp -Ofast -lm -o ./output
I have read:
What GCC optimization flags and techniques are safe across CPUs?
among others. Apologies if I am repeating a question/my question is poorly defined.