I'm reading What Every Programmer Should Know About Memory. Trying to understand example from page 97:
#include <stdlib.h>
#include <stdio.h>
#include <emmintrin.h>
#define N 1000
double res[N][N] __attribute__ ((aligned (64)));
double mul1[N][N] __attribute__ ((aligned (64)));
double mul2[N][N] __attribute__ ((aligned (64)));
#define SM (CLS / sizeof (double))
int
main (void)
{
// ... Initialize mul1 and mul2
int i, i2, j, j2, k, k2;
double *restrict rres;
double *restrict rmul1;
double *restrict rmul2;
for (i = 0; i < N; i += SM)
for (j = 0; j < N; j += SM)
for (k = 0; k < N; k += SM)
for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < SM; ++i2, rres += N, rmul1 += N)
{
_mm_prefetch (&rmul1[8], _MM_HINT_NTA);
for (k2 = 0, rmul2 = &mul2[k][j]; k2 < SM; ++k2, rmul2 += N)
{
__m128d m1d = _mm_load_sd (&rmul1[k2]);
m1d = _mm_unpacklo_pd (m1d, m1d);
for (j2 = 0; j2 < SM; j2 += 2)
{
__m128d m2 = _mm_load_pd (&rmul2[j2]);
__m128d r2 = _mm_load_pd (&rres[j2]);
_mm_store_pd (&rres[j2], _mm_add_pd (_mm_mul_pd (m2, m1d), r2));
}
}
}
// ... use res matrix
return 0;
}
I think I understand non-vectorized example from p.50, but in the vectorized example I can't understand this instruction: _mm_prefetch (&rmul1[8], _MM_HINT_NTA);
. I'd looked through Intel's documentation and found that _mm_prefetch
in this case marks an addres as Non-temporal data, so the processor will not try to fetch it into the cache and save some space for other data. I don't understand what stands behind number 8? Why rmul1 + 8
should not be cached? I think it's connected somehow with size of __m128d
(XMM 128 bit) register size divided by size of double
(equals 8), but I'm not sure. Even in this case it's unclear why such intrinsic is required in this case.
Can someone explain this moment to me?