5

I'm trying to optimize my code down to the last possible cycle, and am wondering if the loop type affects performance when used for array indexing?

I've done some experiments with the following program that just fills an array with 0:

int main(int argc, char **argv)
{
  typedef int CounterType;
  typedef int64_t CounterType;

  CounterType N = atoi(argv[1]);
  uint8_t volatile dummy[N + 16];
  __m128i v = _mm_set1_epi8(0);
  for (int j = 0; j < 1000000; ++j)
  {
    #pragma nounroll
    for (CounterType i = 0; i <= N; i+= CounterType(16))
    {
        _mm_storeu_si128((__m128i *)&dummy[i], v);
    }
  }
  return 0;
}

By using different loop counter types (CounterType) and different compilers, I've recorded the assembly code of the inner loop and performance using hardware performance counters ("perf stat a.out 32768"). I'm running on a Xeon 5670.

GCC4.9, int

.L3
movups  %xmm0, (%rax)
addq    $16, %rax
movl    %eax, %edx
subl    %esi, %edx
cmpl    %ecx, %edx
jle     .L3

 4,127,525,521      cycles                    #    2.934 GHz
12,304,723,292      instructions              #    2.98  insns per cycle

GCC4.9, int64

.L7
movups  %xmm0, (%rcx,%rax)
addq    $16, %rax
cmpq    %rax, %rdx
jge     .L7
4,123,315,191      cycles                    #    2.934 GHz
8,206,745,195      instructions              #    1.99  insns per cycle

ICC11, int64

..B1.6:
movdqu    %xmm0, (%rdx,%rdi)
addq      $16, %rdx
incq      %rcx
cmpq      %rbx, %rcx
jb        ..B1.6        # Prob 82%                      #24.5
2,069,719,166      cycles                    #    2.934 GHz
5,130,061,268      instructions

(faster because of micro-op fusion?)

ICC11, int

..B1.6:                         # Preds ..B1.4 ..B1.6
 movdqu    %xmm0, (%rdx,%rbx)                            #29.38
 addq      $16, %rdx                                     #24.37
 cmpq      %rsi, %rdx                                    #24.34
 jle       ..B1.6        # Prob 82%                      #24.34
4,136,109,529      cycles                    #    2.934 GHz                
8,206,897,268      instructions    

ICC13, int & int64

movdqu    %xmm0, (%rdi,%rax)                            #29.38
addq      $16, %rdi                                     #24.37
cmpq      %rsi, %rdi                                    #24.34
jle       ..B1.7       
4,123,963,321      cycles                    #    2.934 GHz
8,206,083,789      instructions              #    1.99  insns per cycle

The data seems to suggest that int64 is faster. Maybe this is because it matches the pointer size, therefore avoiding any conversions. But I'm not convinced of that conclusion. Another possibility might be that the compiler decided in some cases to do the loop comparison before the store to achieve more parallelism at the cost of 1 extra instruction (due to X86 2 operand instructions being destructive). But that would be incidental, and not fundamentally caused by the loop variable type.

Can some one explain this mystery (preferably knowledgeable about compiler transformations)?

There was also a claim in the CUDA C Best Practices Guide that signed loop counters are simpler than unsigned to generate code for. But that doesn't seem to be relevant here because there are no multiplies in the inner loop for address calculation because that expression gets turned into an induction variable. But apparently in CUDA, it prefers using multiply-add to compute addresses since MADD is 1 instruction just like add and it can cut register use by 1.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
Yale Zhang
  • 1,447
  • 12
  • 30

3 Answers3

2

Yes the loop variable type can affect efficiency.

Let me suggest an even better solution with GCC.

void distance(uint8_t* dummy, size_t n, const __m128 v0)
{
    intptr_t i;
    for(i = -n; i < 0; i += 4) {
        _mm_store_ps(&((float*)dummy)[i+n], v0);
    }
}

With GCC 4.9.2 and GCC 5.3 this produces this main loop

.L5:
        vmovaps %xmm0, (%rdi,%rax)
        addq    $16, %rax
        js      .L5

Clang 3.6 however still generates the cmp

.LBB0_2:                                # =>This Inner Loop Header: 
        vmovaps %xmm0, 8(%rdi,%rax)
        addq    $4, %rax
        cmpq    $-4, %rax
        jl      .LBB0_2

and Clang 3.7 unrolls four times and uses a cmp.

ICC 13 unrolls twice and uses a cmp so only GCC manages to do this without the unnecessary cmp instruction.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • @PaulR, well that's weird now that I fixed it. GCC just uses `memset` with `-O3` now but not with `-O2`. Clang uses `meset` even with `-O2`. – Z boson Dec 18 '15 at 09:02
  • Yes, I was just playing around on [godbolt](http://gcc.godbolt.org) and saw the same thing. – Paul R Dec 18 '15 at 09:06
  • 2
    You can get it to work with all versions of gcc and `-O3` if you hide the fact that the vector is zero: http://goo.gl/P1JQbG – Paul R Dec 18 '15 at 09:10
  • 1
    @PaulR, Thank you! I used your function in my answer. I hope you don't mind. – Z boson Dec 18 '15 at 09:15
  • That's neat that you can get compilers to make loops without a `cmp`; I'd given up on getting compilers to do that. C says that compilers are allowed to re-order memory operations. Does that include just looping in reverse order, so they can simply count the index down from n-1 to 0? – Peter Cordes Dec 18 '15 at 09:38
  • Also, @paul and ZB: totally off topic, but are you guys aware of any good SO post about why stuff like `lea eax, [rdi + rsi*4 - 15]` works even if there is garbage in the upper32 of the 64bit src regs? I'm in the middle of writing up a Q&A about it, since it wasn't obvious to me. (Since I started to ramble about it in http://stackoverflow.com/questions/34058101/referencing-the-contents-of-a-memory-location-x86-addressing-modes, and wanted to write that part up as a separate answer since I didn't find an existing post about it.) – Peter Cordes Dec 18 '15 at 09:42
  • @Zboson: sure - no problem! – Paul R Dec 18 '15 at 10:23
  • @PeterCordes: can't help with that specific issue, but your "ramblings" made for interesting reading (as always!). – Paul R Dec 18 '15 at 10:26
  • 1
    @Paul and ZB: Here's the Q&A I was talking about, now that I've finally posted it: http://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in. Hopefully there isn't something like that already written, because I spent many hours keeping it from ballooning to twice as long or something. :P – Peter Cordes Dec 20 '15 at 04:10
  • Very neat trick. I knew about iterating in reverse to compare with 0, but that's not cache friendly. But the programmer should definitely not have to do that. This should be a machine specific compiler optimization pass for architectures that have register + register addressing – Yale Zhang Sep 18 '20 at 23:00
1

gcc 4.9.2 does a really poor job of compiling the version with an int loop counter. gcc 5.1 and later on godbolt make sane loops:

    call    strtol
    mov     edx, eax
   ...
    xor     eax, eax
.L7:
    movups  XMMWORD PTR [rcx+rax], xmm0
    add     rax, 16
    cmp     edx, eax
    jge     .L7        ; while(N >= (idx+=16))

This can run at one iteration per cycle on Intel CPUs (other than L1 cache miss bottlenecks), even if the store doesn't micro-fuse (since cmp/jge macro-fuses into a single uop).

I'm not sure why gcc 4.9.2 makes such a dumb loop. It decides it wants to increment a pointer, but then it subtracts the start address every time to compare against N, instead of computing an end address and using it as the loop condition. It computes i from its pointer into the array using 32bit operations, which is actually safe since gcc only wants a 32b result. The upper 32b of the inputs don't affect the low 32b of the result, if gcc had done 64bit math.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • It can be even saner. You can set rcx to the end of the array and start rax at -N and then count up to zero and then `jnz` without using the cmp. – Z boson Dec 18 '15 at 08:29
  • @Zboson: yup, if you're going to use an indexed addressing mode anyway, that's a good way to do it. In this case, it should only make an observable difference with hyperthreading, when sharing a core with code that bottlenecks on ALU execution ports, though. Well, and code size / uop cache consumption. – Peter Cordes Dec 18 '15 at 09:12
  • 1
    In th OPs example it may not matter but in other cases it can. I think the OP was looking for a general answer. Not just the one in his/her question. – Z boson Dec 18 '15 at 09:17
-1

As far as I'm aware loop types do not interfere with performance and execution speed, for optimization only things that count are:

  1. Don't let the loop run more then required
  2. Break if specific condition is fulfilled

For filling the 2D array with numbers, if you did the above 2, your execution complexity will be

(number of elements) * (number of commands within loop)

Each line in your loop counts as +1 to number of commands.

That's optimization from programming view, only other thing to make it faster is having a better processor which can execute more commands per second but that's up to the user.

EDIT:

Note that in some cases it's faster to use a pointer to the array and fill elements within a single loop rather then having 2 loops. C allows a lot of variations of the same algorithm.

Survaf93
  • 162
  • 1
  • 11