Both declarations take up the same amount of space. Either method should be usable from a code-correctness standpoint.
half2
should probably be preferred for at least 2 reasons:
When doing a warp-wide load, one half2
per thread will be a more efficient load (or store) than one half
per thread.
Some operations like addition and multiplication will only achieve full (ie. highest) throughput on the architectures that support such operations (cc5.3, and cc6.1 and higher) when done on a half2
type (ignoring Volta tensorCore operations for this discussion).
In other respects, the decision about whether to use one or the other might be equivalent to asking such about any other vector type, such as int
vs. int2
. I'm not going to try to give a complete summary on the motivations to use vector types here.
The underlying memory storage pattern for the two examples given would be the same, so in general, a properly aligned pointer to half
should be castable (say, on proper/alternating index boundaries) to half2
and likewise a half2
pointer should be safely castable to half
.
Even if you choose to use half2
, usage of half
may be appropriate or unavoidable in certain situations, for example if you need to modify an individual quantity, or for example for certain CUBLAS function calls. In that case, at least with the most recent versions of CUDA, it is possible to convert half2
to half
(and vice versa) as you would with any other vector type:
__device__ half2 array[16];
...
half2 myval = array[0];
half first = myval.x;
half second = myval.y;
and as already mentioned, a half2
pointer can be safely cast to half
:
half2 *data = array+2;
half *hdata = reinterpret_cast<half *>(data);
This somewhat related question may be of interest also.