2

I'm reading What Every Programmer Should Know About Memory. Trying to understand example from page 97:

#include <stdlib.h>
#include <stdio.h>
#include <emmintrin.h>
#define N 1000
double res[N][N] __attribute__ ((aligned (64)));
double mul1[N][N] __attribute__ ((aligned (64)));
double mul2[N][N] __attribute__ ((aligned (64)));
#define SM (CLS / sizeof (double))
int
main (void)
{
  // ... Initialize mul1 and mul2
  int i, i2, j, j2, k, k2;
  double *restrict rres;
  double *restrict rmul1;
  double *restrict rmul2;
  for (i = 0; i < N; i += SM)
    for (j = 0; j < N; j += SM)
      for (k = 0; k < N; k += SM)
        for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < SM; ++i2, rres += N, rmul1 += N)
        {
          _mm_prefetch (&rmul1[8], _MM_HINT_NTA);
          for (k2 = 0, rmul2 = &mul2[k][j]; k2 < SM; ++k2, rmul2 += N)
          {
            __m128d m1d = _mm_load_sd (&rmul1[k2]);
            m1d = _mm_unpacklo_pd (m1d, m1d);
            for (j2 = 0; j2 < SM; j2 += 2)
            {
              __m128d m2 = _mm_load_pd (&rmul2[j2]);
              __m128d r2 = _mm_load_pd (&rres[j2]);
              _mm_store_pd (&rres[j2], _mm_add_pd (_mm_mul_pd (m2, m1d), r2));
            }
          }
        }

  // ... use res matrix
  return 0;
}

I think I understand non-vectorized example from p.50, but in the vectorized example I can't understand this instruction: _mm_prefetch (&rmul1[8], _MM_HINT_NTA);. I'd looked through Intel's documentation and found that _mm_prefetch in this case marks an addres as Non-temporal data, so the processor will not try to fetch it into the cache and save some space for other data. I don't understand what stands behind number 8? Why rmul1 + 8 should not be cached? I think it's connected somehow with size of __m128d (XMM 128 bit) register size divided by size of double (equals 8), but I'm not sure. Even in this case it's unclear why such intrinsic is required in this case.

Can someone explain this moment to me?

0e39bf7b
  • 21
  • 1
  • IIRC (and I'm not sure, hence not an answer), marking NTA means it's for short term use, so shouldn't stay in the caches (see 6.3.2 in the article). It's asking for the data at `rmul1 + 8` (which I'm guessing is the next batch?) to be fetched, but not be kept in higher caches (as you aren't going to need it). – Hasturkun Jul 17 '22 at 15:58
  • 1
    It can't skip cache entirely because this is reading from a normal memory region (WB-cacheable, thus strongly-ordered). (Related: [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) - It's prefetching into L1d cache and one way of L3, on Nehalem and newer CPUs since they have inclusive L3 caches.) @Hasturkun It's just minimizing pollution, not preventing it from being cached. – Peter Cordes Jul 17 '22 at 16:25
  • 1
    `8` doubles is a full cache line (64 bytes). An `__m128d` is only 16 bytes wide, 2 doubles. You can't divide bits by bytes. So I think it's prefetching for the next iteration of the loop it's in (which does quite a bit of other work, so that may be far enough ahead to be useful.) – Peter Cordes Jul 17 '22 at 16:27
  • I though about cache line size, but why the author did not use SM/CLS constants in this case? – 0e39bf7b Jul 17 '22 at 17:55
  • @PeterCordes you are right, I have an error in the calculations – 0e39bf7b Jul 18 '22 at 03:36
  • IDK, writing it as `1*SM` would make sense to me, to tune in terms of number of cache lines ahead. – Peter Cordes Jul 18 '22 at 03:48

0 Answers0