Compiling + benchmarking with optimization disabled is your main problem. Un-optimized code with intrinsics is typically very slow. It's not useful to compare intrinsics vs. scalar with optimization disabled, gcc -O0
usually hurts intrinsics more.
Once you stop wasting your time with unoptimized code, you'll want to let gcc -O3
optimize the scalar loop into 3 memset
operations, rather than interleaving 3 streams of stores in a loop.
Compilers (and optimized libc memset
implementations) are good at optimizing memory zeroing, and can probably do a better job than your simple manual vectorization. It's probably not too bad, though, especially if you arrays aren't already hot in L1d cache. (If they were, then using only 128-bit stores is much slower than 256-bit or 512-bit vectors on CPUs with wide data paths.)
I'm not sure what the best way to multi-thread with OpenMP while still letting the compiler optimize to memset
. It might not be terrible to let omp parallel for
parallelize code that stores 3 streams from each thread, as long as each thread is working on contiguous chunks in each array. Especially if code that later updates the same arrays will be distributed the same way, so each thread is working on a part of the arrays that it zeroed earlier, and is maybe still hot in L1d or L2 cache.
Of course, if you can avoid it, do the zeroing on the fly as part of another loop that has some useful computation. If your next step is going to be a[i] += something
, then instead optimize that to a[i] = something
for the first pass through the array so you don't need to zero it first.
See Enhanced REP MOVSB for memcpy for lots of x86 memory performance details, especially the "latency bound platforms" section which explains why single-threaded memory bandwidth (to L3/DRAM) is worse on a big Xeon than a quad-core desktop, even though aggregate bandwidth from multiple threads can be much higher when you have enough threads to saturate the quad-channel memory.
For store performance (ignoring cache locality for later work) I think it's best to have each thread working on (a chunk of) 1 array; a single sequential stream is probably the best way to max out single-threaded store bandwidth. But caches are associative, and 3 streams is few enough that it won't usually cause conflict misses before you write a full line. Desktop Skylake's L2 cache is only 4-way associative, though.
There are DRAM page-switching effects from storing multiple streams, and 1 stream per thread is probably better than 3 streams per thread. So if you do want each thread zeroing a chunk of 3 separate arrays, ideally you'd want each thread to memset its whole chunk of the first array, then its whole chunk of the 2nd array, rather than interleaving 3 arrays.