1

I have an implementation of a digital bessel filter which is a performance bottleneck for my program. I would like to parallelize the main loop with OpenMP.

The function takes an input array paddedsignal and arrays of filter coefficients dcof and ccof and produces a temporary array temp which is filtered. The main loop of this process looks like this:

    for (i=order; i<end; i++)
    {
        temp[i] = ccof[0]*paddedsignal[i];
        for (p=1; p<=order; p++)
        {
            temp[i] += ccof[p]*paddedsignal[i-p] - dcof[p]*temp[i-p];
        }
    }

In particular, note that the value of temp[i] depends on previous values of temp[i-p]. This confounds a simple #pragma omp for directive, since the value of temp[i] near the boundaries between sections of the array which would be handled by different threads has race condition problems. Basically if thread one takes indices between 0-99 and thread 2 takes 100-199, the value at 100 will be wrong since it will be computed before the value at 99, on which it depends.

Is there a way to salvage this situation? I am despairing as to what I can do to parallelize this since the fact that filtered values depend on neighboring values makes it inherently a serial calculation. Is there any way around this?

KBriggs
  • 1,220
  • 2
  • 18
  • 43
  • 1
    It's a serial calculation. The only way to make it go faster is to get a faster CPU. An alternative might be to use a different sort of filter that is more amenable to parallel execution. Or if the data to be filtered can be batched somehow you have different batches being filtered by different threads. – bazza Feb 10 '17 at 21:29
  • I think batch-filtering is going to be the way to go – KBriggs Feb 10 '17 at 23:21
  • 1
    The inner loop is maybe like a [prefix sum like problem](http://stackoverflow.com/a/35840595/2542702)? – Z boson Feb 13 '17 at 08:44
  • Interesting, it does look like there are "parallels" – KBriggs Jun 16 '17 at 15:24

2 Answers2

1

You basically cannot parallelize the outer loop due to the dependency you correctly identify.

You could parallelize the inner loop with a reduction. But that would only be helpful for large order and even then probably just lead to false sharing because the data is shuffled around. One might concieve a complicated data laout to interleave signal and temp among threads, which would require sort-of a manual work sharing construct. I doubt it would be worth it anyway, since the reduction will add overhead cost.

It is easier to parallelize filters that have a finite impulse response, if that is an alternative for you. You could also segment the whole signal into slightly overlapping segments and compute the filter independently, but there will be some computational error at the borders of those chunks.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • That's more or less my conclusion. I will have to look elsewhere for a more natural place to parallelize. The code already implements a chunk-filtering strategy which wraps this function, so that's probably a more natural place to do it. – KBriggs Feb 10 '17 at 23:20
0

Maybe you are looking for the #pragma omp for ordered. It executes the code in the same order in which it would be executed sequentially

But keep in mind:

  • can only be used in the dynamic extent of a for directive
  • very "expensive" (it's likely that your speedup will not be as good as you might think)

For more information see: omp ordered

QuickSort
  • 494
  • 5
  • 14
  • What's the point of forcing serial execution in a multi-threaded way? Where would the speedup come from at all? – KBriggs Feb 10 '17 at 21:20
  • to be honest, I have never used that directive as I am using multithreading only when I have independend work to do (unlike your example). But I can remember that it was in some case useful for sequential I/O of a partners project. If you want more information about the ordered clause: http://stackoverflow.com/questions/13224155/how-does-the-omp-ordered-clause-work – QuickSort Feb 10 '17 at 21:33
  • 1
    `ordered` only makes some sense if you have significant parts of the iteration that are not ordered. For this filter, you would have to order almost the entire loop body, essentially seriallizing everything. – Zulan Feb 10 '17 at 22:52