The title is in lack of a better name, and I am not sure I managed to explain myself clearly enough. I am looking for a way to access a "data type" via an index, but not force the compiler to keep it in an array. The problem occurs in writing a low-level code based on SSE/AVX intrinsics.
For ease of programming I would like to write code as the following, with fixed-length loops over "registers" (data type __m512
):
inline void load(__m512 *vector, const float *in)
{
for(int i=0; i<24; i++)
vector[i] = _mm512_load_ps((in + i*SIMD_WIDTH));
}
// similarely: inline add(...) and inline store(...)
void add(float *in_out, const float *in)
{
__m512 vector1[24];
__m512 vector2[24];
load(vector1, in_out);
load(vector2, in);
add(vector1, vector2);
store(in_out, vector1);
}
The fact that vector1
and vector2
are defined as arrays appears to be troublesome for the compiler (icc
in my case): it seems to be forced to make it "addressable", keeps it on the stack, and thus generates a lot of load
and store
instructions that I do not need. As far is I understand this is to allow for pointer arithmetics with vector1
or vector2
.
I would like the compiler to keep everything only in registers. I know that this is possible: if I write the code without the __m512
arrays but many individual __m512
variables the compiler generates better code.
Solutions:
I (unsuccessfully) tried using a class like the following (I had hoped that the switch-statement and everything else would get optimized out):
class Vector
{
inline __m512 & operator[](int i)
{
switch(i)
case 0: return component0;
// ...
case 23: return component23;
}
__m512 component0;
// ...
__m512 component23;
};
I have also considered macros, but could not find a good solution.
Any suggestions?
Thanks,
Simon
Following a comment in an answer below, here a more detailed example of what I want to do (this is still a simplification though):
inline void project(__m512 *projected_vector, __m512 *vector)
{
for(int i=0; i<3; i++)
projected_vector[i] = _mm512_add_ps(vector[i], vector[i+3]);
}
inline void matrix_multiply(__m512 *out, const float *matrix, __m512 *in)
{
for(int i=0; i<3; i++)
{
out[i] = _mm512_mul_ps( matrix[3*i+0], in[0]);
out[i] = _mm512_fmadd_ps(matrix[3*i+1], in[1], out[i]);
out[i] = _mm512_fmadd_ps(matrix[3*i+2], in[2], out[i]);
}
}
inline void reconstruct(__m512 *vector, __m512 *projected_vector)
{
for(int i=0; i<3; i++)
vector[i] = _mm512_add_ps(vector[i], projected_vector[i]);
for(int i=0; i<3; i++)
vector[i+3] = _mm512_sub_ps(vector[i], projected_vector[i]);
}
inline void hopping_term(float *in_out, const float *matrix_3x3, const float *in)
{
__m512 vector_in[6];
__m512 vector_out[6];
__m512 half_vector1[3];
__m512 half_vector2[3];
load(vector_in, in);
project(half_vector1, vector_in);
matrix_multiply(half_vector2, matrix_3x3, half_vector1);
load(vector_out, in_out);
reconstruct(vector_out, half_vector2);
store(in_out, vector_out);
}