4

The following loops transpose an integer matrix to another integer matrix. when I compiled interestingly it generates movaps instruction to store the result into the output matrix. why gcc does this?

data:

int __attribute__(( aligned(16))) t[N][M]  
  , __attribute__(( aligned(16))) c_tra[N][M];

loops:

for( i=0; i<N; i+=4){
    for(j=0; j<M; j+=4){

        row0 = _mm_load_si128((__m128i *)&t[i][j]);
        row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
        row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
        row3 = _mm_load_si128((__m128i *)&t[i+3][j]);

        __t0 = _mm_unpacklo_epi32(row0, row1);
        __t1 = _mm_unpacklo_epi32(row2, row3);
        __t2 = _mm_unpackhi_epi32(row0, row1);
        __t3 = _mm_unpackhi_epi32(row2, row3);

        /* values back into I[0-3] */
        row0 = _mm_unpacklo_epi64(__t0, __t1);
        row1 = _mm_unpackhi_epi64(__t0, __t1);
        row2 = _mm_unpacklo_epi64(__t2, __t3);
        row3 = _mm_unpackhi_epi64(__t2, __t3);

        _mm_store_si128((__m128i *)&c_tra[j][i], row0);
        _mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
        _mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
        _mm_store_si128((__m128i *)&c_tra[j+3][i], row3);



    }
}

Assembly generated code:

.L39:
    lea rcx, [rsi+rdx]
    movdqa  xmm1, XMMWORD PTR [rdx]
    add rdx, 16
    add rax, 2048
    movdqa  xmm6, XMMWORD PTR [rcx+rdi]
    movdqa  xmm3, xmm1
    movdqa  xmm2, XMMWORD PTR [rcx+r9]
    punpckldq   xmm3, xmm6
    movdqa  xmm5, XMMWORD PTR [rcx+r10]
    movdqa  xmm4, xmm2
    punpckhdq   xmm1, xmm6
    punpckldq   xmm4, xmm5
    punpckhdq   xmm2, xmm5
    movdqa  xmm5, xmm3
    punpckhqdq  xmm3, xmm4
    punpcklqdq  xmm5, xmm4
    movdqa  xmm4, xmm1
    punpckhqdq  xmm1, xmm2
    punpcklqdq  xmm4, xmm2
    movaps  XMMWORD PTR [rax-2048], xmm5
    movaps  XMMWORD PTR [rax-1536], xmm3
    movaps  XMMWORD PTR [rax-1024], xmm4
    movaps  XMMWORD PTR [rax-512], xmm1
    cmp r11, rdx
    jne .L39

gcc -Wall -msse4.2 -masm="intel" -O2 -c -S skylake linuxmint

-mavx2 or -march=naticve generate VEX-encoding :vmovaps.

Amiri
  • 2,417
  • 1
  • 15
  • 42

1 Answers1

8

Functionally those instructions are the same. I don't like to copy+paste other people statements as mine so few links explaining it:

Difference between MOVDQA and MOVAPS x86 instructions?

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587

http://masm32.com/board/index.php?topic=1138.0

https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/

Short version:

So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type.

So it is GCC optimization.

Community
  • 1
  • 1
Anty
  • 1,486
  • 1
  • 9
  • 12
  • 2
    It's actually Intel and AMD recommended code generation practice. In fact, for modern CPUs Intel recommends you always use ``movups`` since aligned & unaligned loads have the same performance--aligned writes matter more. See [Intel](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) software optimization guides. – Chuck Walbourn Feb 15 '17 at 17:28
  • @ChuckWalbourn `movups` and `movaps` only have the same performance since Nehalem. But even that's misleading because `movups` cannot fold operations on so really only `vmovaps` is obsolete. So are you sure that's Intel's and AMD's recommendation? Maybe they mean to always use `vmovups` if your hardware supports it. – Z boson Feb 16 '17 at 08:54
  • @ChuckWalbourn I search through the Intel manual you pointed to but I did not find the recommendation you mentioned. What section are you referring to. I also searched for `vmovaps` and it's shown several times in code so even Intel still uses it. – Z boson Feb 16 '17 at 09:00
  • Intel 11.6.3. You can certainly use ``movaps`` when you know for sure it's aligned, but the point is that there's no longer a major perf penalty for unaligned loads like there used to be. `vmovaps` is just ``movaps`` using the AVX VEX prefix, and the AVX optimizations tend to focus on aligned memory operations. The other use of ``movaps`` is register-to-register moves, but .I don't see many code samples of the non-VEX version of ``movaps`` being used for memory loads but there is a lot of use of ``movups``. – Chuck Walbourn Feb 16 '17 at 17:32
  • The AMD64 manual hasn't been updated since 2005 so it's still got the use ``movaps`` instead of ``movups`` section 9.4, but newer family guides point out that even when accessing aligned memory, ``movaps`` and `movups` have the same performance (AMD Family 16h 2.5.2) . Really that's the meat of it: It used to really matter that you used ``movaps`` for aligned memory pointers and ``movups`` for unaligned memory, but now ``movups`` is safer. It is the same speed when the memory is aligned, and won't throw an exception when the memory is unaligned. It makes the compiler codegen a little easier. – Chuck Walbourn Feb 16 '17 at 17:42