Minimizing global memory reads in OpenCL with vectors?

Question

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?

If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.

What do you mean by "unrelated"? The memory engine usually requests memory in segments of, say, 128 bytes (this number depends on the hardware), so if all 4 floats are within 128 bytes from each other (consecutive), you'll only get one memory transaction. This is called coalescent memory access, auto vectorization is a completely unrelated thing. — user703016, Dec 14 '14 at 18:19
By "unrelated", I mean conceptually unrelated - it's not semantically intuitive to store them as a single vector. They're not, say, 4 coordinates of a point in space, or color values for a pixel, or otherwise the sort of thing that vectors are intended for. — Josh, Dec 14 '14 at 18:32
Is it enough for all the floats to be within 128 bytes from each other, or do they also have to be in the same 128-byte-aligned block? Can I guarantee appropriate storage by making them a single vector rather than 4 separate values? — Josh, Dec 14 '14 at 18:36
128 bytes was just an example. You may want to [read this question](http://stackoverflow.com/questions/17924705/structure-of-arrays-vs-array-of-structures-in-cuda). — user703016, Dec 14 '14 at 18:38
I've done a few of searches about coalesced memory reads, and haven't found anything explaining when and how it might happen and when you can rely on it. If, when reading one value, it happens to be adjacent to a value that will be read later, is the compiler smart enough to notice this and compile the second access not to read global memory a second time? If the second read isn't accessing global memory, what *is* it accessing? Private memory? — Josh, Dec 14 '14 at 18:39
you can use scalar kernels depend on the hardware microarchitecture or you can use float4 float 8 float16 structures within this vectoral kernel and fewer threads — huseyin tugrul buyukisik, Dec 14 '14 at 20:27
It totally depends on your hardware. Some OpenCL-capable devices do not do any wide vector reads at all (and OpenCL driver reports this fact as preferred vector length = 1), some got some fairly wide load/store capabilities and powerful swizzles, and for the latter, compiler might decide to merge sequential reads into a single vector read (I'm aware of at least one implementation which does this). — SK-logic, Dec 14 '14 at 22:04

score 0 · Accepted Answer · answered Feb 19 '15 at 11:34

Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).

On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.

Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).

Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.

Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.

As example, I will use 128 bits wide bus and 4 floats (32 bits).

 (32b * 4) / 128b = 1 instance/read

For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.

 32b / 128b = 4 instance/read

So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

Minimizing global memory reads in OpenCL with vectors?

1 Answers1