The task is very simple, writting a seqence of integer variable to memory:
Original code:
for (size_t i=0; i<1000*1000*1000; ++i)
{
data[i]=i;
};
Parallelized code:
size_t stepsize=len/N;
#pragma omp parallel num_threads(N)
{
int threadIdx=omp_get_thread_num();
size_t istart=stepsize*threadIdx;
size_t iend=threadIdx==N-1?len:istart+stepsize;
#pragma simd
for (size_t i=istart; i<iend; ++i)
x[i]=i;
};
The performance sucks, it takes 1.6 sec to writing 1G uint64
variables (which is equal to 5GB per sec), by simple parallelization (open mp parallel
)of the above code, the speed increase abit, but performance still sucks, take 1.4 sec with 4 threads and 1.35 with 6 threads on a i7 3970.
The theortical memory bandwidth of my rig (i7 3970/64G DDR3-1600) is 51.2 GB/sec, for the above example, the achieved memory bandwidth is only about 1/10 of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded.
Anyone know how to improve the code?
I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth).
EDIT:
The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization.
UPDATE:
I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization.
As for why I want to fill the numbers there, well, long story short:
Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory.
Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (238GB/sec out of 288GB/sec on GK110 vs a pathetic 5GB/sec out of 51.2GB/sec on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here.
As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my memcpy()
test, the achieved bandwidth is about 80%+ of the theoritical bandwidth (>40GB/sec).