/* Inner product. Accumulate in temporary */
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Write a version of the inner product procedure described in the above problem that uses 6 × 1a loop unrolling . For x86-64, our measurements of the unrolled version give a CPE of 1.07 for integer data but still 3.01 for both floating-point data.
My code for 6*1a version of loop unrolling
void inner4(vec_ptr u, vec_ptr v, data_t *dest){
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
long limit = length -5;
data_t sum = (data_t) 0;
for(i=0; i<limit; i+=6){
sum = sum +
((udata[ i ] * vdata[ i ]
+ udata[ i+1 ] * vdata[ i+1 ])
+ (udata[ i+2 ] * vdata[ i+2 ]
+ udata[ i+3 ] * vdata[ i+3 ]))
+ ((udata[ i+4 ] * vdata[ i+4 ])
+ udata[ i+5 ] * vdata[ i+5 ]);
}
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Question: Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.
Any idea how to solve the problem?