Threading and vectorisation optimisations

Question

I'm really new to code optimisation techniques, and I'm currently trying to optimise a loop section of a piece of code, which should be trivially easy.

for (int i = 0; i < N; i++)
{
    array[i] = 0.0f;
    array2[i] = 0.0f;
    array3[i] = 0.0f;
}

I tried to implement vectorisation and threading as follows:

int i;
int loop_unroll = (int) (N/4)*4;

#pragma omp parallel for shared(array,array2,array3)
for(int i = 0; i < loop_unroll; i+=4)
{
    __m128 array_vector = _mm_load_ps(array+i);
    array_vector = _mm_set1_ps(0.0f);

    _mm_store_ps(array+i, array_vector);
    _mm_store_ps(array2+i, array_vector);
    _mm_store_ps(array3+i, array_vector);
}

for(;i<N;i++)
{
    array[i] = 0.0f;
    array2[i] = 0.0f;
    array3[i] = 0.0f;
}

Regardless of the input size N i run this with, the 'optimised' version always takes longer. I thought this was due to the overhead associated with setting up the threads and registers, but for the largest N before the program becomes too slow to use, the overhead still isn't mitigated by the faster code.

This makes me wonder if the optimisation techniques used are implemented incorrectly?

The three arrays are used for calculating forces later in the program. This is the first step, just zeroing out the arrays so they can be used later in the program. Why would doing them at the same time have any effect on performance? Surely seperating it into 3 seperate for loops would be even less efficient? — JustACodingGrunt, Mar 11 '18 at 14:53
This code would defeat the work the optimizer does. It won't use a for-loop. Be sure to look at the -O2 generated machine code. — Hans Passant, Mar 11 '18 at 14:55
[Somewhat related](https://stackoverflow.com/questions/16699247/what-is-cache-friendly-code). I don't know. — user202729, Mar 11 '18 at 14:58
@HansPassant Apologies, forgot to mention the optimizer isn't being used in compiling the code, all compiler based optimisations are disabled for this code. But how would I go about looking at the -O2 output, if I were to activate it? — JustACodingGrunt, Mar 11 '18 at 15:00
@user202729 Is there an alternative to memset in the AVX 512 instruction set? — JustACodingGrunt, Mar 11 '18 at 15:12
Is it still slower if you use *only* vectorization or *only* threading? — Rotem, Mar 11 '18 at 15:44
@Rotem yeah, I've tried every combination to get this loop running quicker than the unoptimsed form, I'm kinda stumped — JustACodingGrunt, Mar 11 '18 at 16:47
Your i in the inner loop is shadowint the i defined outside the parallel region. In what you may have intended as a cleanup loop, i is not intialized, so you may be spending lots of time there. I doubt there is any good way to recover a final value of i when leaving the parallel region, but you can easily calculate the require value to start the cleanup loop. Depending on the age of your CPU, you may need to align the storage; anyway you need to use aligned. nontemporal store tp match memset performance. You are shooting yourself in the foot where any decent compiler can do better. — tim18, Mar 11 '18 at 18:55

score 0 · Answer 1 · answered Mar 12 '18 at 01:45

Compiling + benchmarking with optimization disabled is your main problem. Un-optimized code with intrinsics is typically very slow. It's not useful to compare intrinsics vs. scalar with optimization disabled, gcc -O0 usually hurts intrinsics more.

Once you stop wasting your time with unoptimized code, you'll want to let gcc -O3 optimize the scalar loop into 3 memset operations, rather than interleaving 3 streams of stores in a loop.

Compilers (and optimized libc memset implementations) are good at optimizing memory zeroing, and can probably do a better job than your simple manual vectorization. It's probably not too bad, though, especially if you arrays aren't already hot in L1d cache. (If they were, then using only 128-bit stores is much slower than 256-bit or 512-bit vectors on CPUs with wide data paths.)

I'm not sure what the best way to multi-thread with OpenMP while still letting the compiler optimize to memset. It might not be terrible to let omp parallel for parallelize code that stores 3 streams from each thread, as long as each thread is working on contiguous chunks in each array. Especially if code that later updates the same arrays will be distributed the same way, so each thread is working on a part of the arrays that it zeroed earlier, and is maybe still hot in L1d or L2 cache.

Of course, if you can avoid it, do the zeroing on the fly as part of another loop that has some useful computation. If your next step is going to be a[i] += something, then instead optimize that to a[i] = something for the first pass through the array so you don't need to zero it first.

See Enhanced REP MOVSB for memcpy for lots of x86 memory performance details, especially the "latency bound platforms" section which explains why single-threaded memory bandwidth (to L3/DRAM) is worse on a big Xeon than a quad-core desktop, even though aggregate bandwidth from multiple threads can be much higher when you have enough threads to saturate the quad-channel memory.

For store performance (ignoring cache locality for later work) I think it's best to have each thread working on (a chunk of) 1 array; a single sequential stream is probably the best way to max out single-threaded store bandwidth. But caches are associative, and 3 streams is few enough that it won't usually cause conflict misses before you write a full line. Desktop Skylake's L2 cache is only 4-way associative, though.

There are DRAM page-switching effects from storing multiple streams, and 1 stream per thread is probably better than 3 streams per thread. So if you do want each thread zeroing a chunk of 3 separate arrays, ideally you'd want each thread to memset its whole chunk of the first array, then its whole chunk of the 2nd array, rather than interleaving 3 arrays.

Intel cpus since woodcrest handle at least 6 forward going streams efficiently, if all are aligned. In tests a few years ago, I found performance gains up to 4 streams by combining in a single loop — tim18, Mar 12 '18 at 12:29
If you have a rule against optimization, simd intrinsics may be better than plain C code. Otherwise, intrinsics will have an advantage only if using nontemporal stores, require alignment of all data. Intel C++ has pragma for nontemporal but will require also specific designation of mutual data alignment. Nontemporal should give close to double performance in memset operations. Intel CPUs since Woodcrest handle up to 6 store streams per loop efficiently. In tests a few years ago, I found measurable gains up to 4 streams in comparison with splitting a loop. — tim18, Mar 12 '18 at 12:57
@tim18: Some `memset` implementations use NT stores above a certain size threshold, but yeah, choosing whether or not to use them is a good idea when you know whether data will be re-read soon. See https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy (already linked in the question) for more about NT stores on modern x86 (Skylake), and how exactly they work. — Peter Cordes, Mar 12 '18 at 21:58
@tim18: Interesting point about multiple streams. Was that with a single thread, or 4 streams from every thread in a multi-threaded case? I haven't seen any suggestion that `memset` should be implemented by dividing the destination into quarters and interleaving stores. — Peter Cordes, Mar 12 '18 at 22:01
@tim18: Intrinsics only have a hope of being better if you optimize for `-O0` instead of writing it normally. i.e by doing as much as possible in a single statement. Like `_mm_store_ps(array+i, _mm_setzero_ps())` might be ok, but less likely with a separate `__m128` variable that will have to get loaded every time. (Although it should hit in L1d cache...) — Peter Cordes, Mar 12 '18 at 22:39

Threading and vectorisation optimisations

1 Answers1