Improve OpenMP/SSE parallelization effect

Question

I'm tried to improve performance in some routine via OpenMP(parallel for) and SSE intrinsics:

void Tester::ProcessParallel()//ProcessParallel is member of Tester class
{
    //Initialize
    auto OutMapLen      = this->_OutMapLen;
    auto KernelBatchLen = this->_KernelBatchLen;
    auto OutMapHeig     = this->_OutMapHeig;
    auto OutMapWid      = this->_OutMapWid;
    auto InpMapWid      = this->_InpMapWid;
    auto NumInputMaps   = this->_NumInputMaps;
    auto InpMapLen      = this->_InpMapLen;
    auto KernelLen      = this->_KernelLen;
    auto KernelHeig     = this->_KernelHeig;
    auto KernelWid      = this->_KernelWid;
    auto input_local    = this->input;
    auto output_local   = this->output;
    auto weights_local  = this->weights;
    auto biases_local   = this->biases;
    auto klim           = this->_klim;

    #pragma omp parallel for firstprivate(OutMapLen,KernelBatchLen,OutMapHeig,OutMapWid,InpMapWid,NumInputMaps,InpMapLen,KernelLen,KernelHeig,KernelWid,input_local,output_local,weights_local,biases_local,klim)
    for(auto i=0; i<_NumOutMaps; ++i)
    {   
        auto output_map   = output_local  + i*OutMapLen;
        auto kernel_batch = weights_local + i*KernelBatchLen;
        auto bias = biases_local + i;
        for(auto j=0; j<OutMapHeig; ++j)
        {
            auto output_map_row = output_map + j*OutMapWid;
            auto inp_row_idx = j*InpMapWid;
            for(auto k=0; k<OutMapWid; ++k)
            {
                auto output_nn = output_map_row + k;
                *output_nn     = *bias;
                auto inp_cursor_idx = inp_row_idx + k;
                for(int _i=0; _i<NumInputMaps; ++_i)
                {
                    auto input_cursor = input_local + _i*InpMapLen + inp_cursor_idx;
                    auto kernel = kernel_batch + _i*KernelLen;
                    for(int _j=0; _j<KernelHeig; ++_j)
                    {
                        auto kernel_row_idx  = _j*KernelWid;
                        auto inp_row_cur_idx = _j*InpMapWid;
                        int _k=0;
                        for(; _k<klim; _k+=4)//unroll and vectorize
                        {
                            float buf;
                            __m128 wgt = _mm_loadu_ps(kernel+kernel_row_idx+_k);
                            __m128 inp = _mm_loadu_ps(input_cursor+inp_row_cur_idx+_k);
                            __m128 prd = _mm_dp_ps(wgt, inp, 0xf1);
                            _mm_store_ss(&buf, prd);
                            *output_nn += buf;
                        }
                        for(; _k<KernelWid; ++_k)//residual loop
                            *output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
                    }
                }
            }
        }
    }
}

Pure unrolling and SSE-vectorization (without OpenMP) of last nested loop improves total performance ~1.3 times - it's pretty nice result. Howewer, pure OpenMP parallelization (without unrolling/vectorization) of external loop gives only ~2.1 performance gain on 8-core processor (core i7 2600K). In total, both SSE vectorization and OpenMP parallel_for shows 2.3-2.7 times performance gain. How can I boost OpenMP parallelization effect in the code above?

Interesting: if replace "klim" variable - bound in unrolling last loop - with scalar constant, say, 4, total performance gain rises to 3.5.

Using aligned version of load intrinsic. ? You can accumulate in xmm register instead of output_nn. In the end, you store it to memory. — huseyin tugrul buyukisik, Jul 26 '13 at 11:02
I think about it, but I don't understand how to implement pointers aligment in nested loops. — gorill, Jul 26 '13 at 11:07
Selecting an address which is a multiple of 16 for the starting point of buffer/kernel should let you use _mm_load_ps. Also using a single accumulator for whole nested loop, is prone to numerical instability. You should accumulate in several cells such as an xmm register's elements. — huseyin tugrul buyukisik, Jul 26 '13 at 11:08
When vectorizing what looks like a dot product, it is usual to avoid horizontal operations in the loop, you only need one at the end (and your compiler should be able to vectorize the code without you doing it manually). — Marc Glisse, Jul 26 '13 at 11:13
input, output, weights, biases - already 16 bytes aligned using _mm_malloc. But when I trying to use _mm_load_ps instead _mm_loadu_ps, runtime crashes with "segmentation fault" error((( — gorill, Jul 26 '13 at 11:18
What about `KernelWid`? Is it a multiple of 16? Also `input_cursor` depends directly on the value of `k` which increments by one, giving you unaligned addresses. — Hristo Iliev, Jul 26 '13 at 16:33
No, KernelWid not always multiple of 16. Is there any tricks to 16-align pointers "on fly" without drop of perfomance? For example, how to align input_cursor pointer inside the loop? — gorill, Jul 26 '13 at 17:41
The trick is to keep all data aligned in memory, usually at the expense of using more memory than necessary. But one should do it carefully - sometimes the gain from having all data aligned, e.g. saving 2 cycles per load operation, could be negated by the increased amount of data loaded (as memory is transferred in units of cache lines). — Hristo Iliev, Jul 26 '13 at 19:54
@redrum, thanks for advice. But could you please explain more - why _mm_dp_ps is not efficient? — gorill, Aug 12 '13 at 21:20
[see this link](http://stackoverflow.com/questions/14967969/efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod) — Z boson, Aug 13 '13 at 07:54

Hristo Iliev · Accepted Answer · 2013-07-26T17:10:55.080

Vectorisation and threading do not work orthogonally (in respect to speeding up the calculations) in most cases, i.e. their speed-ups do not necessarily add up. What's worse is that this happens mostly in cases like yours, where data is being processed in a streaming fashion. The reason for that is simple - finite memory bandwidth. A very simple measure of whether this is the case is the so-called computational intensity (CI), defined as the amount of data processing (usually in FLOPS) performed over a byte of input data. In your case you load two XMM registers, which makes 32 bytes of data in total, then perform one dot product operation. Let's have your code running on a 2 GHz Sandy Bridge CPU. Although DPPS takes full 12 cycles to complete on SNB, the CPU is able to overlap several such instructions and retire one every 2 cycles. Therefore at 2 GHz each core could perform 1 billion dot products per second in a tight loop. It would require 32 GB/s of memory bandwidth to keep such a loop busy. The actual bandwidth needed in your case is less since there are other instructions in the loop, but still the main idea remains - the processing rate of the loop is limited by the amount of data that the memory is able to feed to the core. As long as all the data fits into the last-level cache (LLC), performance would more or less scale with the number of threads as the LLC usually provides fairly high bandwidth (e.g. 300 GB/s on Xeon 7500's as stated here). This is not the case once data grows big enough not to fit into the cache as the main memory usually provides an order of magnitude less bandwidth per memory controller. In the latter case all cores have to share the limited memory speed and once it is saturated, adding more threads would not result in increase of the speed-up. Only adding more bandwidth, e.g. having a system with several CPU sockets, would result in an increased processing speed.

There is a theoretical model, called the Roofline model, that captures this in a more formal way. You can see some explanations and applications of the model in this presentation.

The bottom line is: both vectorisation and multiprocessing (e.g. threading) increase the performance but also increase the memory pressure. As long as the memory bandwidth is not saturated, both result in increased processing rate. Once the memory becomes the bottleneck, performance does not increase any more. There are even cases when multithreaded performance drops because of the additional pressure put by vectorisation.

Possibly an optimisation hint: the store to *output_nn might not get optimised since output_nn ultimately points inside a shared variable. Therefore you might try something like:

for(auto k=0; k<OutMapWid; ++k)
{
    auto output_nn = output_map_row + k;
    auto _output_nn = *bias;
    auto inp_cursor_idx = inp_row_idx + k;
    for(int _i=0; _i<NumInputMaps; ++_i)
    {
        ...
        for(int _j=0; _j<KernelHeig; ++_j)
        {
            ...
            for(; _k<klim; _k+=4)//unroll and vectorize
            {
                ...
                _output_nn += buf;
            }
            for(; _k<KernelWid; ++_k)//residual loop
                _output_nn += *(kernel+kernel_row_idx+_k) * *(input_cursor+inp_row_cur_idx+_k);
        }
    }
    *output_nn = _output_nn;
}

But I guess your compiler is smart enough to figure it by itself. Anyway, this would only matter in the single-threaded case. Once you are into the saturated memory bandwidth region, no such optimisations would matter.

It's amazing! This hint pushed up performance gain to 6 times! — gorill, Jul 26 '13 at 20:07

Improve OpenMP/SSE parallelization effect

1 Answers1