Why are CUDA memory allocations aligned to 256 bytes?

Question

According to cuda alignment 256bytes seriously? CUDA memory allocations are guaranteed to be aligned to at least 256 bytes.

Why is that the case? 256 bytes is much larger than any numeric data type. It might be the size of a vector, but GPUs do not require load/store to be aligned to the size of the whole vector, indeed they go so far as to support gather/scatter where every individual element may be placed at any memory address that is a multiple of the size of the element.

What purpose does the 256-byte alignment serve?

As indicated in the question you linked, CUDA devices have a texture alignment requirement, (also surface alignment requirement, and others). When I run `deviceQuery` on my T4, it reports a texture alignment requirement of 512 bytes. So one reason to provide this kind of allocation granularity is to support those kinds of needs. If you're asking "what is it about the texture system that requires an alignment of 512 bytes?" I won't be able to answer that. However textures sometimes have spatial caching behavior, and a spatial cache could expect higher than a single type alignment. — Robert Crovella, Nov 17 '20 at 15:25

einpoklum · Accepted Answer · 2020-11-18T07:53:21.393

3

Why is that the case? 256 bytes is much larger than any numeric data type.

Well, I'm sure there are multiple reasons (e.g. it's easier to manage fewer, larger, allocations), but about your specific point: Don't think about a single value of a numeric data type - think about a full warp's worth: if sizeof(float) is 4, then a warp's worth of floats is 32 * 4 = 128 bytes. And if it's a double or long int (64-bit int), then you get 32 * 8 = 256 .

Note: It is not necessary for warps to make such coalesced reads of multiple values from memory. A single thread can read a single unaligned byte and that will work. But - performance will suffer if the read pattern is not coalesced to reading contiguous, aligned, chunks (typically of 128 bytes or 32 bytes); see also:

In CUDA, what is memory coalescing, and how is it achieved?

edited Nov 18 '20 at 07:53

answered Nov 17 '20 at 14:59

einpoklum

118,144
57
340
684

Right, but it's not like a warp is required to be stored at an address aligned to the size of the whole warp, is it? I thought the alignment requirement was only the size of the individual elements? – rwallace Nov 17 '20 at 15:51
2

No, that's not required, but it is very much encouraged, and there's a performance penalty if a full warp reads from memory that's not 128-byte aligned (or 32-byte aligned, it depends on the circumstances). The alignment requirement for a valid read is indeed the size of the element, but nobody said that was going to be fast :-( ... Also - why does this bother you? Remember, you shouldn't be making a lot of allocations anyway. They're slow and costly. – einpoklum Nov 17 '20 at 19:02
Aha! I did not realize that. It only bothered me because I didn't understand what was going on, so it didn't make sense. Okay, thanks! – rwallace Nov 18 '20 at 01:00

Why are CUDA memory allocations aligned to 256 bytes?

1 Answers1