2

I have large block of data to calculate:

static float source0[COUNT];
static float source1[COUNT];
static float result[COUNT];    /* result[i] = source0[i] * source1[i]; */

s0 = (size_t)source0;
s1 = (size_t)source1;
r = (size_t)result;

They are all 32-byte aligned.

The related SSE code:

for(i = 0; i < COUNT; i += 16)
{
    __asm volatile
    (
        "movntdqa xmm0, [%0]\n\t"
        "movntdqa xmm1, [%1]\n\t"
        "mulps xmm1, xmm0\n\t"
        "movntps [%2], xmm1"
        : : "r"(s0 + i), "r"(s1 + i), "r"(r + i) : "xmm0", "xmm1"
    );
}

The related AVX code:

for(i = 0; i < COUNT; i += 32)
{
    __asm volatile
    (
        "vmovapd ymm0, [%0]\n\t"
        "vmovapd ymm1, [%1]\n\t"
        "vmulps ymm1, ymm1, ymm0\n\t"
        "vmovntps [%2], ymm1"
        : : "r"(s0 + i), "r"(s1 + i), "r"(r + i) : "ymm0", "ymm1"
    );
}

The result is that AVX code used time is always nearly the same as SSE code. But they are much faster then normal C code. I think the major reason is that "vmodapd" does not support "NT" version, until AVX2 extension. This causes too much d-cache pollution.

Is there any better way to explore the power of AVX(not AVX2)?

Gary Yin
  • 525
  • 6
  • 17
  • 3
    How big is `COUNT`? This looks entirely memory-bound. – Mysticial Sep 20 '15 at 03:20
  • 1
    The data block size is 32MB. – Gary Yin Sep 20 '15 at 03:30
  • 10
    Yeah, you're completely memory bound. See this: http://stackoverflow.com/a/18159503/922184 – Mysticial Sep 20 '15 at 03:33
  • 2
    Is it _absolutely necessary_ for you to store all multiplication results before proceeding? You could 1) do extra work with the data you stream (additions/masks/comparisons?), which would increase the cost of a loop iteration and thus reduce your memory bandwidth requirements, or 2) You could cut up your data in less-than-cache-sized chunks. – Iwillnotexist Idonotexist Sep 20 '15 at 03:42
  • Thanks, I see. The memory bandwidth is reason. My computer hardware is CORE i5 2520M @ 2.5GHz(~3.2GHz), DDR3 1333 @ 4GB. – Gary Yin Sep 20 '15 at 03:43
  • 4
    What IwillnotexistIdonotexist is talking about is called cache blocking, or loop tiling. It's a very valuable technique, and will mean you should stop using movnt, because you want your data in cache. Also, the Intel manual says `VMOVNTPS ymm` is AVX, not AVX2. Did you try it? And why are you using integer and double loads when your data is float? It won't cause a bypass delay on Intel, but `movntps` has the shortest encoding. (2 bytes shorter than `movntdqa`). Also, you could do that with intrinsics just as easily. Then you wouldn't have to use `asm volatile` and hinder optimization. – Peter Cordes Sep 20 '15 at 06:55
  • @Peter Cordes: 1) you mis-understand me. I just don't want to use the d-cache, because all the floating data is NOT temporal. That means the floating data will not be used in the near future. 2) I says the "VMOVAPD" doesn't has related "NT" version, and the "VMOVNTDQA" does NOT support 256bit YMM reg for AVX extension. So if I want to load floating data to 256bit YMM reg, I can use "VMOVAPD". Although it's implemented for integer, it's fit for floating too(I don't care the floating exception when loading). – Gary Yin Sep 20 '15 at 13:08
  • 1
    @GaryYin: **1)** We're telling you to look at restructuring the code that this loop is a part of to work in cache-sized blocks. *If* that's possible, you can you can see huge performance gains. If not, then yes, NT moves are appropriate. **2)** You don't get FP exceptions while loading with any instruction. You're right there's no 256B NT load until AVX2, oops. I'm not sure you're non-temporal *loads* are any different from normal loads on WB memory, with Sandybridge. They're mostly useful for reading from (WC) video memory; getting the CPU to briefly cache a whole cache line. – Peter Cordes Sep 20 '15 at 15:01
  • 1
    The main reason NT stores are more important than NT loads is that they change the cache-coherency semantics. Being weakly-ordered, the CPU doesn't have to read-for-ownership on a cacheline as soon as you write on part of it. So what's going on with your code is that 128b SSE with NT stores is already sufficient to saturate your memory bandwidth. The fact that AVX runs at the *same* speed, not slower, is proof that NT loads aren't important. It takes about 8B of read or write per clock to saturate main memory. Your SSE code could do 2*128b / clock, or AVX 3*256b / 2clocks on SnB. – Peter Cordes Sep 20 '15 at 15:07
  • 1
    Your operation is memory bound but your implementation of it [is nevertheless not getting the maximum bandwidth](http://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). Put `#pragma omp parallel for` before your for loop and compile with `-O3 -fopenmp`. – Z boson Sep 25 '15 at 07:45

0 Answers0