I working with a convolution and, in particular, I'm trying to speedup its execution. To obtain this acceleration I'm using a SIMD instruction in order to perform two multiplication at the same time where the result of one is put in the 32 higher bit of a 64 bit variable while the other result is in 32 lower bit. The problem is that the new code seems not working as the old one.
The initial code contains this for-loop
int32_t v32;
int16_t arr_2[1024];
int16_t data[96];
int32_t accu;
...
for(int j=0; j<INPUT_F; j++){
v32 = arr_2[l*OUT_F+j]*data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+j]
accu += v32;
}
...
the questions is: apart for the multiplication functions, are the other operations equivalent or am I doing something wrong ?
uint64_t v64;
int16_t arr_2[1024];
int16_t data[96];
int32_t accu;
...
for(int j=0; j<INPUT_F/2; j++){
v64 = __mul(arr_2[l*OUT_F+2*j],data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+2*j]); //use a simd instruction to perform mul between two consecutive values in the arrays.
accu += ((int32_t)(v64 & 0xFFFFFFFF); //first value
accu += ((int32_t)((v64 >> 32) & 0xFFFFFFFF); //second value
}
...
__mul() is defined as uint64_t __mul(uint32_t a, uint32_t b);
and even if the operands are uint32_t it takes into account the fact that there are two int16_t values internally.