Assembly code redundancy in optimized C code

Question

I'm trying to learn about vectorization by studying simple C code compiled in gcc with -O3 optimization. More specifically, how well compilers vectorize. It is a personal journey towards being able to verify gcc -O3 performance with more complex computation. I understand that conventional wisdom is that compilers are better than people, but I never take such wisdom for granted.

In my first simple test, though, I'm finding some of the choices gcc makes quite strange and, quite honestly, grossly negligent in terms of optimization. I'm willing to assume there is something the compiler is purposeful and knows something about the CPU (Intel i5-2557M in this case) that I do not. But I need some confirmation from knowledgeable people.

My simple test code (segment) is:

int i;
float a[100];

for (i=0;i<100;i++) a[i]= (float) i*i;

The resulting assembly code (segment) that corresponds to the for-loop is as follows:

.L6:                        ; loop starts here
    movdqa  xmm0, xmm1      ; copy packed integers in xmm1 to xmm0
.L3:
    movdqa  xmm1, xmm0      ; wait, what!?  WHY!?  this is redundant.
    cvtdq2ps    xmm0, xmm0  ; convert integers to float
    add rax, 16             ; increment memory pointer for next iteration
    mulps   xmm0, xmm0      ; pack square all integers in xmm0
    paddd   xmm1, xmm2      ; pack increment all integers by 4 
    movaps  XMMWORD PTR [rax-16], xmm0   ; store result 
    cmp rax, rdx            ; test loop termination
    jne .L6

I understand all the steps, and computationally, all of it makes sense. What I don't understand, though, is gcc choosing to incorporate in the iterative loop a step to load xmm1 with xmm0 right after xmm0 was loaded with xmm1. i.e.

 .L6
        movdqa  xmm0, xmm1      ; loop starts here
 .L3
        movdqa  xmm1, xmm0      ; grrr!

This alone makes me question the sanity of the optimizer. Obviously, the extra MOVDQA does not disturb data, but at face-value, it would seems grossly negligent on the part of gcc.

Earlier in the assembly code (not shown), xmm0 and xmm2 are initialized to some value meaningful for vectorization, so obviously, at the onset of the loop, the code has to skip the first MOVDQA. But why doesn't gcc simply rearrange, as shown below.

.L3
        movdqa  xmm1, xmm0     ; initialize xmm1 PRIOR to loop
.L6
        movdqa  xmm0, xmm1     ; loop starts here

Or even better, simply initialize xmm1 instead of xmm0 and dump the MOVDQA xmm1, xmm0 step altogether!

I am prepared to believe that the CPU is smart enough to skip the redundant step or something like that, but how can I trust gcc to fully optimize complex code, if it can even get this simple code right? Or can someone provide a sound explanation that would give me faith that gcc -O3 is good stuff?

Did you compile with optimizations turned on. On some optimization levels, the redundant move operation is eliminated. — Thomas Matthews, Feb 24 '15 at 21:49
Are you sure that your code is faster than the compilers? Have you tried timing them? — Degustaf, Feb 24 '15 at 21:49
@Stefan Not being one of the downvoters, I would guess that it is because this question verges on being a rant against gcc. — Degustaf, Feb 24 '15 at 21:51
C compilers are generally considered worse at vectorisation. That's why GCC offers intrinsics. — Tommy, Feb 24 '15 at 22:25
@Tommy: My experience is that gcc and clang does this fairly well these days - some 8-10 years ago the code-generation wasn't very clever at all, but now you have to either be a bit lucky or very clever [or cheat and do something different than the C code says] to beat the compiler (on platforms with wide support at least, e.g. ARM and x86, I'm not so sure about the more obscure processor architectures) — Mats Petersson, Feb 24 '15 at 22:32
@MatsPetersson I was specifically thinking of my experiences building for ARM and contrasting raw compiler output with that achieved if you use Apple's Accelerate Framework. However Apple uses LLVM nowadays so I can no longer substantiate my prejudice. — Tommy, Feb 24 '15 at 22:36
On modern x86's, `register-register` moves are often done by register-renaming, which means they take no micro-ops. While the move may be redundant, it's unlikely to significantly impact performance. On the other hand, sometimes instructions can have false dependencies on their output (`popcnt` on Intel micro-architectures, for example). If `cvtdq2ps` had a false output dependency on the target micro-architecture, the version with the move could even be faster in a loop. — EOF, Feb 24 '15 at 22:49
gcc-4.8 seems to be the only version that generates this code. — Marc Glisse, Feb 24 '15 at 22:58
.L3 seems to be a redundant label that might've been leftover from some optimization pass. (possibly removing the entry condition to the for-loop) So I'm not surprised that other stuff got left behind. After all, compiler's aren't perfect. — Mysticial, Feb 24 '15 at 23:50
@Mysticial there is a jump to .L3 just above, in the code that is not shown. That is where the loop is entered. — Marc Glisse, Feb 25 '15 at 07:39
On the question of real speed improvement, I don't know if the code runs faster. What I'm trying to understand is if someone can explain if the compilers logic is sound, or if the conventional wisdom optimization by compiler vs people is indeed valid. It goes into whether I choose to use Fortran or C for my ultimate goal, which is more complex and massively compute intensive than this simple test. — codechimp, Feb 25 '15 at 15:39

Mats Petersson · Answer 1 · 2015-02-24T22:29:57.990

4

I'm not 100% sure, but it looks like your loop destroys xmm0 by converting it to float, so you to have the integer value in xmm1 and then copy over to another register (in this case xmm0).

Whilst compilers are known to sometimes issue unnecessary instructions, I can't really see how this is the case in this instance.

If you want xmm0 (or xmm1) to remain integer, then don't have a cast of float for the first value of i. Perhaps what you wanted to do is:

 for (i=0;i<100;i++) 
    a[i]= (float)(i*i);

But on the other hand, gcc 4.9.2 doesn't seem to do this:

g++ -S -O3 floop.cpp

.L2:
    cvtdq2ps    %xmm1, %xmm0
    mulps   %xmm0, %xmm0
    addq    $16, %rax
    paddd   %xmm2, %xmm1
    movaps  %xmm0, -16(%rax)
    cmpq    %rbp, %rax
    jne .L2

Nor does clang (3.7.0 from about 3 weeks ago)

 clang++ -S -O3 floop.cpp


    movdqa  .LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
    xorl    %eax, %eax
    .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
    movd    %eax, %xmm1
    pshufd  $0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
    paddd   %xmm0, %xmm1
    cvtdq2ps    %xmm1, %xmm1
    mulps   %xmm1, %xmm1
    movaps  %xmm1, (%rsp,%rax,4)
    addq    $4, %rax
    cmpq    $100, %rax
    jne .LBB0_1

Code that I have compiled:

extern int printf(const char *, ...);

int main()
{
    int i;
    float a[100];

    for (i=0;i<100;i++)
        a[i]= (float) i*i;

    for (i=0; i < 100; i++)
        printf("%f\n", a[i]);
}

(I added the printf to avoid the compiler getting rid of ALL the code)

edited Feb 24 '15 at 22:29

answered Feb 24 '15 at 21:51

Mats Petersson

126,704
14
140
227

But that's what actually happens. If you look at the assembly you can see, that xmm0 is converted to float, squared, and saved. The question is, why the compiler overwrites xmm1 after loop jump. – Marandil Feb 24 '15 at 21:57
Ah, good point. So, it's just another case of "writing compilers is hard". If you fancy a challenge, I'd say you can have a go at finding where this happens in gcc, and propose a fix. – Mats Petersson Feb 24 '15 at 22:02
2

Or just upgrade to a more recent gcc, perhaps? – Mats Petersson Feb 24 '15 at 22:13
That is useful to know that the latest version of gcc is better. I did not notice that I was using an older version. And that the optimization is still being improved. – codechimp Feb 25 '15 at 15:41
I don't doubt that writing compiler is hard. That is why I'm actually inclined to use inline assembly where necessary. But there would be considerable learning on my part, not only the instruction set but also to learn how CPU optimized code exectution. When I probe the internet on whether it was worth the trouble, the answer seemed to be "no". But I wanted to see for myself. And ended with this query. – codechimp Feb 25 '15 at 15:49
One the question of type casting, I added the (float) to further improve complied code, otherwise, it was trying to do the (i*i) as more complex integer arithmetic and almost erased any gain from vectorization. I was also disappointed that the compiler didn't offer to do just simply treat i as FP, which would have simplified the code some (eliminate type conversion). But I was willing to cut it some slack on that because of the potential effect on precision. – codechimp Feb 25 '15 at 16:04
Yes, the compiler needs to perform the multiplication in integer unless you specifically ask it NOT to, since float is missing about 8 bits of "precision" over integer. If you have SSE4.1 capable, then it should have PMULLD - but of course, the compiler needs to also understand to generate that instruction. – Mats Petersson Feb 25 '15 at 22:33
And I have just raised a bug on LLVM/Clang that the loop vectorizer doesn't generate PMULLD for the above loop when using `(float)(i*i)`. – Mats Petersson Feb 25 '15 at 23:27

Assembly code redundancy in optimized C code

1 Answers1

Linked