Memory coalescing in global writes

Question

In CUDA devices, is coalescing in global memory writes as important as coalescing in global memory reads? If yes, how can it be explained? Also are there differences between early generations of CUDA devices and most recent ones regarding this issue?

The problem of coalescence is extensively discussed in the CUDA C Programming Guide (Section 5.3.2) and the CUDA C Best Practice Guide (Section 9.2.1). The two guides cover also the problem of coalescence for different architectures. To avoid replicating the material, it would be more constructive, if you take a look at those documents and post the points that are obscure and require clarifications. — Vitality, Nov 25 '13 at 10:17

Robert Crovella · Accepted Answer · 2019-07-09T14:04:46.733

Coalesced writes (or lack thereof) can affect performance, just as coalesced reads (or lack thereof) can.

A coalesced read occurs when a read request triggered by a warp instruction, e.g.:

int i = my_int_data[threadIdx.x+blockDim.x*blockIdx.x];

can be satisified by a single read transaction in the memory controller (which is essentially saying all the individual thread reads are coming from a single cache line.)

A coalesced write occurs when a write request triggered by a warp instruction, e.g.:

my_int_data[threadIdx.x+blockDim.x*blockIdx.x] = i;

can be satisfied by a single write transaction in the memory controller.

For the above examples I have shown, there are no differences generationally.

But there are other types of reads or writes that could coalesce (i.e. collapse to a single memory controller transaction) in later devices but not in earlier devices. One example is a "broadcast read":

int i = my_int_data[0];

In the above example, all threads read from the same global location. In newer devices, such a read would be "broadcast" to all threads in a single transaction. In some earlier devices, this would result in a serialized servicing of threads. Such an example probably has no corollary in writes, because multiple threads writing to a single location gives undefined behavior. However a "scrambled" write may coalesce on newer devices but not older:

my_int_data[(threadIdx.x+5)%32] = i;

Note that all the writes above are unique (within the warp) and belonging to an individual cache line, but they do not satisfy the coalescing requirements on 1.0 or 1.1 devices, but should on newer devices.

If you read the global memory access description for devices of cc 1.0 and 1.1, and compare to later devices, you will see some of the requirements for coalescing on earlier devices that have been relaxed on later devices.

Thanks. Can you please explain more how the cache is involved in case of a write? You have pointed out in a coalesced read transaction, "all the individual thread reads are coming from a single cache line." So in case of a write, non-coalesced writes occupy several L2 cache lines, right? — Farzad, Nov 25 '13 at 18:07
Yes, a non-coalesced memory transaction spans more than one cache line, whether read or write. The cache iself is not at issue here. The cacheline is a fundamental quantum of interchange enforced by the memory controller. — Robert Crovella, Nov 25 '13 at 18:09

score 1 · Answer 2 · answered Nov 25 '13 at 17:42

1

We did this experiment in a course I conducted. Coalescing turned out to be moderately more important in writes than in reads perhaps because the L1 and L2 caches store some of the unused data for later use.

answered Nov 25 '13 at 17:42

Levi Barnes

357
3
12

Memory coalescing in global writes

2 Answers2