Suppose I have an array of data, for example an array of 3D vectors of size N. Suppose that each iteration of my SYCL kernel is exclusively or primarily concerned with only one vector. Which of the following ways of breaking this into contiguous buffers is, as a general rule, more efficient--or does it matter?
I realize the targeted device affects this a lot, so let's assume it's a discrete GPU (i.e. the data really does have to be copied to a different memory chip and the device doesn't have some crazy architecture like an FPGA--I'm mainly targeting a GTX 1080 via CUDA, but I expect the answer is likely similar when the code is compiling to OpenCL or we're using another modern GPU.
- Create a separate buffer for each coordinate, e.g.
sycl::buffer<float> x, y, z;
, each of size N. This way when accessing them I can use thesycl::id<1>
passed to my kernel lambda as the index with no math. (I suspect the compiler may be able to optimize this.) - Create one packed buffer for all of them, e.g.
sycl::buffer<float> coords;
with size 3N. When accessing them with asycl::id<1>
calledi
, I then grab the x coordinate asbuffer_accessor[3*i]
, the y coordinate asbuffer_accessor[3*i+1]
, and the z coordinate asbuffer_accessor[3*i+2]
. (I don't know if the compiler can optimize this, and I'm not sure if alignment issues might come into play.) - Create one unpacked buffer using a struct, e.g.
struct Coord { float x,y,z; }; sycl::buffer<Coord> coords;
. This has the rather alarming cost of increasing memory usage, in this example by 33%, because of alignment padding--which will also increase the time required to copy the buffer to the device. But the tradeoff is that you can access the data without manipulating thesycl::id<1>
, the runtime only has to deal with one buffer, and there shouldn't be any cache line alignment inefficiencies on the device. - Use a two-dimensional buffer of size (N,3) and iterate only over the range of the first dimension. This is a less flexible solution and I don't see why I'd want to use multidimensional buffers when I'm not iterating over all the dimensions, unless there's a lot of optimization built in for this use case.
I cannot find any guidelines on data architecture to get an intuition for this sort of thing. Right now (4) seems silly, (3) involves unacceptable memory waste, and I'm using (2) but wondering if I mightn't should be using (1) instead to avoid the id manipulation and 3*sizeof(float) aligned access chunks.