OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

Question

I'm going to improve OCL kernel performance and want to clarify how memory transactions work and what memory access pattern is really better (and why). The kernel is fed with vectors of 8 integers which are defined as array: int v[8], that means, before doing any computation entire vector must be loaded into GPRs. So, I believe the bottleneck of this code is initial data load.

First, I consider some theory basics.

Target HW is Radeon RX 480/580, that has 256 bit GDDR5 memory bus, on which burst read/write transaction has 8 words granularity, hence, one memory transaction reads 2048 bits or 256 bytes. That, I believe, what CL_DEVICE_MEM_BASE_ADDR_ALIGN refers to:

Alignment (bits) of base address: 2048.

Thus, my first question: what is the physical sense of 128-byte cacheline? Does it keep the portion of data fetched by single burst read but not really requested? What happens with the rest if we requested, say, 32 or 64 bytes - thus, the leftover exceeds the cache line size? (I suppose, it will be just discarded - then, which part: head, tail...?)

Now back to my kernel, I think that cache does not play a significant role in my case because one burst reads 64 integers -> one memory transaction can theoretically feed 8 work items at once, there is no extra data to read, and memory is always coalesced.

But still, I can place my data with two different access patterns:

contiguous

a[i] = v[get_global_id(0) * get_global_size(0) + i];

(which actually performed as)

*(int8*)a = *(int8*)v;

interleaved

a[i] = v[get_global_id(0) + i * get_global_size(0)];

I expect in my case contiguous would be faster because as said above one memory transaction can completely stuff 8 work items with data. However, I do not know, how the scheduler in compute unit physically works: does it need all data to be ready for all SIMD lanes or just first portion for 4 parallel SIMD elements would be enough? Nevertheless, I suppose it is smart enough to fully provide with data at least one CU first, as soon as CU's may execute command flows independently. While in second case we need to perform 8 * global_size / 64 transactions to get a complete vector.

So, my second question: is my assumption right?

Now, the practice.

Actually, I split entire task in two kernels because one part has less register pressure than another and therefore can employ more work items. So first I played with pattern how the data stored in transition between kernels (using vload8/vstore8 or casting to int8 give the same result) and the result was somewhat strange: kernel that reads data in contiguous way works about 10% faster (both in CodeXL and by OS time measuring), but the kernel that stores data contiguously performs surprisingly slower. The overall time for two kernels then is roughly the same. In my thoughts both must behave at least the same way - either be slower or faster, but these inverse results seemed unexplainable.

And my third question is: can anyone explain such a result? Or may be I am doing something wrong? (Or completely wrong?)

score 1 · Answer 1 · answered Oct 12 '17 at 14:53

Well, not really answered all my question but some information found in vastness of internet put things together more clear way, at least for me (unlike abovementioned AMD Optimization Guide, which seems unclear and sometimes confusing):

«the hardware performs some coalescing, but it's complicated...
memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.
thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.»

So, for my case it does not realy matter how the data is read/stored while it is always contiguous, but the order the parts of vectors are loaded may affect the consequent command flow (re)scheduling by compiler.
I also can imagine that newer GCN architecture can do cached/coalesced writes, that is why my results are different from those prompted by that Optimization Guide.

score 0 · Answer 2 · answered Oct 06 '17 at 16:50

0

Have a look at chapter 2.1 in the AMD OpenCL Optimization Guide. It focuses mostly on older generation cards but the GCN architecture did not completely change, therefore should still apply to your device (polaris).

In general AMD cards have multiple memory controllers to which in every clock cycle memory requests are distributed. If you for example access your values in column-major instead of row-major logic your performance will be worse because the requests are sent to the same memory controller. (by column major I mean a column of your matrix is accessed together by all the work-items executed in the current clock cycle, this is what you refer to as coalesced vs interleaved). If you access one row of elements (meaning coalesced) in a single clock cycle (meaning all work-items access values within the same row), those requests should be distributed to different memory controllers rather than the same.

Regarding alignment and cache line sizes, I'm wondering if this really helps improving the performance. If I were in your situation I would try to have a look whether I can optimize the algorithm itself or if I access the values often and it would make sense to copy them to the local memory. But than again it is hard to tell without any knowledge about what your kernels execute.

Best Regards,

Michael

answered Oct 06 '17 at 16:50

MichaelE1000

289
2
14

Thank you for the answer. However I am not talking about coalesced vs interleaved access. May be my writings are not so clear, but the access is always coalesced - the difference is only reading the data vector-wise vs. element-wise. I corrected the question in order to clarify a bit. – qpdb Oct 06 '17 at 17:45
@qpdb the thing you called contiguous is contiguous from kernel's point of view and is interleaved from memory's point of view at a given cycle so reading first element of every workitem can cache remaining data . But on writing, there isn't this behavior so it becomes slower. The thing you called "interleaved" is actually contiguous at a given cycle for the memory because read/write scheduler(or whatever part that combines reads/writes) can serve n workitems that uniformly read/write over a large series of neighboring elements. – huseyin tugrul buyukisik Oct 06 '17 at 17:52
BTW, thanks again for pinting to the documentation. I learned from there: "Southern Island devices do not support coalesced writes; however, continuous addresses within work-groups provide maximum performance.". This information looks double-strange, since my experiments give absolutely different result. Or do I understand the whole concept of "coalescing" entirely contrariwise? – qpdb Oct 06 '17 at 17:59
@huseyin tugrul buyukisik why do you think it is interleaved in memory??? I read it (sequentially) from the host, and it is contiguous. – qpdb Oct 06 '17 at 18:04
I meant gpu's own memory, unless you are using USE_HOST_PTR or ALLOC_HOST_PTR for buffers. – huseyin tugrul buyukisik Oct 06 '17 at 18:19
I use the same buffer that I initially allocated from the host side with CL_MEM_ALLOC_HOST_PTR – qpdb Oct 06 '17 at 18:53
Maybe then device needs to be even more strict on access patterns when data is passing through pci-e? – huseyin tugrul buyukisik Oct 06 '17 at 19:16
That's what I am trying to understand. The host can put data in any pattern, but for the moment I am experimenting with patterns on GPU side only and it gives me quite ambiguous results. – qpdb Oct 06 '17 at 19:23
Maybe you try with a device-side buffer using only CL_MEM_READ_WRITE and not any host_ptr nor alloc_host_ptr and isolate this to only device or to only host by comparing to this question's benchmarks? – huseyin tugrul buyukisik Oct 06 '17 at 19:25
I split buffer in two, the second created with CL_MEM_READ_WRITE, but it does not change much neither in performance nor in ISA code. – qpdb Oct 11 '17 at 14:20

OpenCL (AMD GCN) global memory access pattern for vectorized data: strided vs. contiguous

2 Answers2