12

The task is very simple, writting a seqence of integer variable to memory:

Original code:

for (size_t i=0; i<1000*1000*1000; ++i)
{
   data[i]=i;
};

Parallelized code:

    size_t stepsize=len/N;

#pragma omp parallel num_threads(N)
    {
        int threadIdx=omp_get_thread_num();

        size_t istart=stepsize*threadIdx;
        size_t iend=threadIdx==N-1?len:istart+stepsize;
#pragma simd
        for (size_t i=istart; i<iend; ++i)
            x[i]=i;
    };

The performance sucks, it takes 1.6 sec to writing 1G uint64 variables (which is equal to 5GB per sec), by simple parallelization (open mp parallel)of the above code, the speed increase abit, but performance still sucks, take 1.4 sec with 4 threads and 1.35 with 6 threads on a i7 3970.

The theortical memory bandwidth of my rig (i7 3970/64G DDR3-1600) is 51.2 GB/sec, for the above example, the achieved memory bandwidth is only about 1/10 of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded.

Anyone know how to improve the code?

I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth).

EDIT:

The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization.

UPDATE:

I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization.

As for why I want to fill the numbers there, well, long story short:

Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory.

Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (238GB/sec out of 288GB/sec on GK110 vs a pathetic 5GB/sec out of 51.2GB/sec on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here.

As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my memcpy() test, the achieved bandwidth is about 80%+ of the theoritical bandwidth (>40GB/sec).

user2188453
  • 1,105
  • 1
  • 12
  • 26
  • 4
    Did you try optimizing the code? E.g. use -O3 if you are using `gcc`? – Mats Petersson Aug 23 '13 at 13:04
  • 1
    At a minimum, you should try unrolling it. *Armchair optimizing*. – unwind Aug 23 '13 at 13:09
  • Try doing 2 writes per iteration. Maybe 4, 8, 16. – user123 Aug 23 '13 at 13:11
  • 3
    @unwind Mohammed That's the kind of thing compilers do. If the assembly code indicates the compiler did a bad job at this, then fine, but in dubio pro compiler ;-) OP, could you show the resulting assembly? –  Aug 23 '13 at 13:11
  • 6
    @delnan Most probably. Time to call Mysticial~ – user123 Aug 23 '13 at 13:13
  • 8
    The elephant in the room is of course: Why do you (think you) need to have memory filled with an increasing integer sequence?! – sehe Aug 23 '13 at 13:21
  • 3
    From where do you get the theoretical bandwidth? – Joni Aug 23 '13 at 13:26
  • @Joni I think he's referencing [this](http://ark.intel.com/products/70845) – user123 Aug 23 '13 at 13:30
  • 1
    In that case they would be looking at the wrong reference; memory bandwidth is further limited by the memory and bus speed – Joni Aug 23 '13 at 13:34
  • 1
    The bus speed isn't the actual problem. It's a multi-channel design, and modern buses are seriously fast. – MSalters Aug 23 '13 at 14:39
  • How much of this is page faulting into your working set? Try timing `for (size_t j=0; j<100; j++) { for (size_t i=0; i<10000*1000; ++i) {data[i]=i;} }`. 80 Mb is enough to flush the L3 cache, but small enough to fit in your WS. – TerryE Aug 23 '13 at 17:56
  • Can you post your actual benchmarking code ? I'm suspicious of the poor results and am wondering if there is some kind of artefact in the benchmarking method. – Paul R Aug 24 '13 at 08:54
  • 1
    The fastest way to do that is to not do it! (Seriously, just compute it later when you would have done a read) – Flexo Aug 24 '13 at 16:31

2 Answers2

12

Assuming this is x86, and that you are not already saturating your available DRAM bandwidth, you can try using SSE2 or AVX2 to write 2 or 4 elements at a time:

SSE2:

#include "emmintrin.h"

const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);

for (size_t i=0; i<1000*1000*1000; i += 2)
{
    _mm_stream_si128((__m128i *)&data[i], v);
    v = _mm_add_epi64(v, v2);
}

AVX2:

#include "immintrin.h"

const __m256i v4 = _mm256_set1_epi64x(4);
__m256i v = _mm256_set_epi64x(3, 2, 1, 0);

for (size_t i=0; i<1000*1000*1000; i += 4)
{
    _mm256_stream_si256((__m256i *)&data[i], v);
    v = _mm256_add_epi64(v, v4);
}

Note that data needs to be suitably aligned (16 byte or 32 byte boundary).

AVX2 is only available on Intel Haswell and later, but SSE2 is pretty much universal these days.


FWIW I put together a test harness with a scalar loop and the above SSE and AVX loops compiled it with clang, and tested it on a Haswell MacBook Air (1600MHz LPDDR3 DRAM). I got the following results:

# sequence_scalar: t = 0.870903 s = 8.76033 GB / s
# sequence_SSE: t = 0.429768 s = 17.7524 GB / s
# sequence_AVX: t = 0.431182 s = 17.6941 GB / s

I also tried it on a Linux desktop PC with a 3.6 GHz Haswell, compiling with gcc 4.7.2, and got the following:

# sequence_scalar: t = 0.816692 s = 9.34183 GB / s
# sequence_SSE: t = 0.39286 s = 19.4201 GB / s
# sequence_AVX: t = 0.392545 s = 19.4357 GB / s

So it looks like the SIMD implementations give a 2x or more improvement over 64 bit scalar code (although 256 bit SIMD doesn't seem to give any improvement over 128 bit SIMD), and that typical throughput should be a lot faster than 5 GB / s.

My guess is that there is something wrong with the OP's system or benchmarking code which is resulting in an apparently reduced throughput.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    Have you... benchmarked it to see whether it is _actually_ faster? – sehe Aug 23 '13 at 13:19
  • 2
    That is left as an exercise for the reader, and of course it will depend on various factors. But since the DRAM bandwidth is allegedly nowhere near saturated in the OP's case I would expect a modest improvement. – Paul R Aug 23 '13 at 13:22
  • My compiler report error about your code, and to be honest, I think modern compiler with O3 will be more than enough for such level SIMD optimzation. – user0002128 Aug 24 '13 at 06:28
  • @H2CO3: it's arguably a poor design decision on the part of whoever originally implemented SSE intrinsics, but it means you're nearly always stuck with casts when working with SSE/AVX loads/stores - you either need to cast pointers to scalar data to pointers to SIMD vectors, or *vice versa*, but there's no escaping it. – Paul R Aug 24 '13 at 08:13
  • 1
    @user0002128: if you let me know the errors for your particular compiler I can try to fix them. As for auto-vectorization - I doubt that even ICC will vectorize this, as it does't fit into any of the standard auto-vectorizarion models, but there's one easy way to find out... – Paul R Aug 24 '13 at 08:14
  • 1
    @PaulR What about `memcpy()`? –  Aug 24 '13 at 08:18
  • @H2CO3: um, what about memcpy() ? – Paul R Aug 24 '13 at 08:26
  • 1
    @PaulR I often hear that `memcpy(dst, src, size)` is **the** tool for avoiding the UB caused by `*dst = *(T *)src`. –  Aug 24 '13 at 08:29
  • @H2CO3: yes, that's often true in the general case, but for performance-critical code the rules tend to be very different. Bear in mind that in this case we have a loop which executes in around 2 - 3 clock cycles per iteration - adding any kind of function call or memory copying here would totally wipe out the performance gain from using SIMD. – Paul R Aug 24 '13 at 08:38
  • 2
    @PaulR Last time I checked, `-O2` inlined `memcpy()` completely. –  Aug 24 '13 at 08:41
  • @H2CO3: sure, but even inlined, you'll have additional instruction cycles and additional memory moves - the code as it stands is just two instructions and it uses two SIMD registers for the vector variables. Adding even the most efficient inline memcpy for the sake of programming style would give a performance hit of one order of magnitude. As I said, this is fine in the general case, but for performance-critical code you have to break many of the usual rules. – Paul R Aug 24 '13 at 08:53
  • @PaulR I understand that. And it's sad. :( –  Aug 24 '13 at 08:54
  • @H2CO3: writing high performance code is a bit like being down in the engine room of ship, getting your hands dirty, etc - whereas general coding is like being up on the bridge in your starched and pressed captain's uniform. The engine room doesn't suit everybody, and likewise for the bridge. ;-) – Paul R Aug 24 '13 at 09:43
  • @PaulR Yeah :) It's not that I don't have a little bit of a pragmatist (opposed to a language lawyer) inside me... but so much people talking about UB on Stack Overflow have discouraged me to believe in things like this actually working at all... –  Aug 24 '13 at 09:48
  • 3
    @H2CO3: I know - all the language lawyers and pedants belong up on the bridge - they are not allowed in the engine room. ;-) – Paul R Aug 24 '13 at 10:02
  • @user0002128: I don't really do much work on Windows and so I can't easily test it, but so long as whatever compiler you use does a reasonable job of code generation I wouldn't expect the underlying hardware to behave any differently. – Paul R Aug 26 '13 at 07:58
  • 1
    @H2CO3: For the record: a `(__m128i *)` cast is always safe because it's defined as being allowed to alias anything, like `char*`. For example, gcc defines `__m128i` with `__attribute__((may_alias))`. Yes this would be a real issue if compiler devs hadn't taken care of it, and no `memcpy` wouldn't be a viable solution. To get NT stores with intrinsics, you *need* to pointer-cast and use `_mm_stream_si128` for the store to the output buffer. Otherwise sure, an inlined + optimized away `memcpy` is the right idiom to express an unaligned load/store of an `int`, or type-punning. – Peter Cordes Jun 07 '18 at 09:59
  • 1
    And BTW, Paul: the key thing here is the NT store, not just vectorization. (Related: [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) for more about mem bandwidth and NT stores vs. regular stores; and single-core bandwidth not saturating DRAM on modern Intel CPUs, especially big Xeons but even desktop / laptop). IDK about back in 2013, but current compilers certainly know how to autovectorize this pattern the same way you did manually. I think they have for a while. – Peter Cordes Jun 07 '18 at 10:02
5

Is there any reason why you would expect all of data[] to be in powered-up RAM pages?

The DDR3 pre-fetchter will correctly predict most accesses but the frequent x86-64 page boundaries might be an issue. You're writing to virtual memory, so at each page boundary there's a potential mis-prediction of the pre-fetcher. You can greatly reduce this by using large pages (e.g. MEM_LARGE_PAGES on Windows).

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • +1, the OP's Sandybridge-E doesn't have next-page prefetching. Ivybridge and later do, which helps some. IDK why using all 6 cores wasn't saturating memory BW, though. Maybe pagefaults + TLB misses explains it. A single core can't saturate memory bandwidth on recent Intel chips, so you need multiple cores running efficiently: [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) and [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020). – Peter Cordes Jun 07 '18 at 10:05