In an auto-vectorized array initialization,
alignas(64)
const float a[16]={
b[i+0],b[i+1],b[i+2],b[i+3], // normal initialization
// self-referencing for duplicated data
a[0],a[1],a[2],a[3],
a[0],a[1],a[2],a[3],
a[0],a[1],a[2],a[3]
};
runs faster than this:
alignas(64)
const float a[16]={
b[i+0],b[i+1],b[i+2],b[i+3], // normal initialization
b[i+0],b[i+1],b[i+2],b[i+3], // normal initialization
b[i+0],b[i+1],b[i+2],b[i+3], // normal initialization
b[i+0],b[i+1],b[i+2],b[i+3], // normal initialization
};
Is there any caveat for using the first version? Can it cause any kind of bug for any compiler?
When I try same for transposed version (where columns are duplicated instead), compiler generates same cpu instructions for both versions. Is this a side-effect of compiler's initialization order of elements or is it about ability of CPU architecture (i.e. not having an efficient instruction for that)?