According to cuda alignment 256bytes seriously? CUDA memory allocations are guaranteed to be aligned to at least 256 bytes.
Why is that the case? 256 bytes is much larger than any numeric data type. It might be the size of a vector, but GPUs do not require load/store to be aligned to the size of the whole vector, indeed they go so far as to support gather/scatter where every individual element may be placed at any memory address that is a multiple of the size of the element.
What purpose does the 256-byte alignment serve?