Can Compiler Optimize Loop with Variable Length?

Question

Can the compiler optimize loops if the last index of the loops (a and b in the following example) are not known at compile time?

Unoptimized:

int* arr = new int[a*b];
for (i = 0; i < a; ++i){
    for(j = 0; j < b; ++j){
        arr[i*b+j] *= 8;
    }
}

//delete arr after done.

More Optimized: (assuming a and b are large...)

int c = a*b;
int* arr = new int[c];
for (i = 0; i < c; ++i){
        arr[c] *= 8;
}

//delete arr after done.

Theoretically yes, provided that the first example is fixed, and the compiler can determine that none of the variables can otherwise be modified. Practically, a given compiler may or may not do that. No single answer will apply to every compiler. — Sam Varshavchik, Sep 27 '16 at 12:14
And you probably want to initialize the array instead of read uninitialized values. — Jarod42, Sep 27 '16 at 12:15
Your program has undefined behavior. In your first example, let's say `a` and `b` are 1 and 7 respectively. The resulting array will be allocated to hold 7 integers. The first assignment to your array will be `arr[0*1+7] = 8` Which is equivalent to `arr[8] = 0`. You just wrote past the end of your buffer. I went on an edited your question, because I assume you meant `arr[ i * a + j] = 8` — selbie, Sep 27 '16 at 12:25
@selbie In my country `0*1+7` does not equal `8`. Also assigning `8` to an array item is not equivalent to assigning `0` to any item, neither the same nor any other. — CiaPan, Sep 27 '16 at 12:41
It's still a buffer overrun. I meant to say `arr[7] =8` But the valid array indices would be [0..6] of an array allocated of length 7. Do you see what I mean? I'll be happy to explain more. — selbie, Sep 27 '16 at 12:45

score 1 · Answer 1 · edited May 23 '17 at 10:27

1

Yes it probably can, given that the sizes are constant and do not change in your loop, as it happens here. Read Optimize "for" loop for more please.

FYI, in your fist example, this:

arr[j*a+b] *= 8;

should be this:

arr[j*a+i] *= 8;

edited May 23 '17 at 10:27

Community

1
1

answered Sep 27 '16 at 12:15

gsamaras

71,951
46
188
305

...or rather `arr[i*b+j] *= 8l` for the sake of memory access locality. – CiaPan Sep 27 '16 at 12:43
Sure @CiaPan, but let's for first make it right, before optimizing! :) – gsamaras Sep 27 '16 at 12:57
Some how these three loops in following link doesn't give the same binary. https://godbolt.org/g/W4KSlC – hamster on wheels Sep 27 '16 at 21:01
I see one loop in the linked question and one in the accepted answer @rxu.. :/ – gsamaras Sep 27 '16 at 21:05
The compiler sometimes can't fuse loop if extent is not literal. I asked a new question about that. ( http://stackoverflow.com/questions/39758300/loop-is-not-vectorized-when-variable-extent-is-used) – hamster on wheels Sep 28 '16 at 23:50

score 1 · Answer 2 · answered Sep 27 '16 at 12:19

1

Modern compilers can definitely change the order of the two array, to prevent unneeded cache misses, from:

for (i = 0; i < a; ++i){
    for(j = 0; j < b; ++j){
        arr[j*a+i] *= 8;
    }
}

to this:

for(j = 0; j < b; ++j){
    for (i = 0; i < a; ++i){
        arr[j*a+i] *= 8;
    }
}

After this optimizations, the two examples (compared to your manual optimization) shouldn't measurably differ in performance.

answered Sep 27 '16 at 12:19

2501

25,460
4
47
87

but it might not always fuse the two loops into a single loop. – hamster on wheels Sep 28 '16 at 23:52

score 1 · Accepted Answer · answered Sep 27 '16 at 12:22

If you treat the array as linear space, gcc (and presumably others) will optimise even without knowing the extents at compile time.

This code:

void by8(int* arr, int a, int b)
{
  auto extent = a * b;
  for (int i = 0; i < extent; ++i)
  {
    arr[i] *= 8;
  }
}

compiles to this (notice how the inner part of the loop is vectorised)

by8(int*, int, int):
        imull   %esi, %edx
        testl   %edx, %edx
        jle     .L23
        movq    %rdi, %rax
        andl    $31, %eax
        shrq    $2, %rax
        negq    %rax
        andl    $7, %eax
        cmpl    %edx, %eax
        cmova   %edx, %eax
        cmpl    $8, %edx
        jg      .L26
        movl    %edx, %eax
.L3:
        sall    $3, (%rdi)
        cmpl    $1, %eax
        je      .L15
        sall    $3, 4(%rdi)
        cmpl    $2, %eax
        je      .L16
        sall    $3, 8(%rdi)
        cmpl    $3, %eax
        je      .L17
        sall    $3, 12(%rdi)
        cmpl    $4, %eax
        je      .L18
        sall    $3, 16(%rdi)
        cmpl    $5, %eax
        je      .L19
        sall    $3, 20(%rdi)
        cmpl    $6, %eax
        je      .L20
        sall    $3, 24(%rdi)
        cmpl    $7, %eax
        je      .L21
        sall    $3, 28(%rdi)
        movl    $8, %ecx
.L5:
        cmpl    %eax, %edx
        je      .L27
.L4:
        leal    -1(%rdx), %r8d
        movl    %edx, %r9d
        movl    %eax, %r10d
        subl    %eax, %r9d
        subl    %eax, %r8d
        leal    -8(%r9), %esi
        shrl    $3, %esi
        addl    $1, %esi
        leal    0(,%rsi,8), %r11d
        cmpl    $6, %r8d
        jbe     .L7
        leaq    (%rdi,%r10,4), %r10
        xorl    %eax, %eax
        xorl    %r8d, %r8d
.L9:
        vmovdqa (%r10,%rax), %ymm0
        addl    $1, %r8d
        vpslld  $3, %ymm0, %ymm0
        vmovdqa %ymm0, (%r10,%rax)
        addq    $32, %rax
        cmpl    %r8d, %esi
        ja      .L9
        addl    %r11d, %ecx
        cmpl    %r11d, %r9d
        je      .L22
        vzeroupper
.L7:
        movslq  %ecx, %rax
        sall    $3, (%rdi,%rax,4)
        leal    1(%rcx), %eax
        cmpl    %eax, %edx
        jle     .L23
        cltq
        sall    $3, (%rdi,%rax,4)
        leal    2(%rcx), %eax
        cmpl    %eax, %edx
        jle     .L23
        cltq
        sall    $3, (%rdi,%rax,4)
        leal    3(%rcx), %eax
        cmpl    %eax, %edx
        jle     .L23
        cltq
        sall    $3, (%rdi,%rax,4)
        leal    4(%rcx), %eax
        cmpl    %eax, %edx
        jle     .L23
        cltq
        sall    $3, (%rdi,%rax,4)
        leal    5(%rcx), %eax
        cmpl    %eax, %edx
        jle     .L23
        cltq
        addl    $6, %ecx
        sall    $3, (%rdi,%rax,4)
        cmpl    %ecx, %edx
        jle     .L28
        movslq  %ecx, %rcx
        sall    $3, (%rdi,%rcx,4)
        ret
.L22:
        vzeroupper
.L23:
        ret
.L27:
        ret
.L26:
        testl   %eax, %eax
        jne     .L3
        xorl    %ecx, %ecx
        jmp     .L4
.L28:
        ret
.L21:
        movl    $7, %ecx
        jmp     .L5
.L15:
        movl    $1, %ecx
        jmp     .L5
.L16:
        movl    $2, %ecx
        jmp     .L5
.L17:
        movl    $3, %ecx
        jmp     .L5
.L18:
        movl    $4, %ecx
        jmp     .L5
.L19:
        movl    $5, %ecx
        jmp     .L5
.L20:
        movl    $6, %ecx
        jmp     .L5

compiler : gcc 5.4 with command line options: -std=c++14 -O3 -march=native

can it optimize the "unoptimized version" to the "more optimized" version? — hamster on wheels, Sep 28 '16 at 16:29
Probably the compiler can't fuse loop (from the unoptimized to more optimized version). This is surprising. Or I did something wrong. http://stackoverflow.com/questions/39758300/loop-is-not-vectorized-when-variable-extent-is-used — hamster on wheels, Sep 29 '16 at 12:17

score 1 · Answer 4 · answered Sep 27 '16 at 14:29

if you are using the visual studio compiler you can use the /Qvec-report command line argument and it will tell you which loops are/are not being vectorized and give you reason codes as to why they are not

vectorization of loops (unlike unrolling) is where the compiler uses the SIMD (SSE,SSE2,AVX) instructions to break the loop into a series of operation which are performed in parallel

https://msdn.microsoft.com/en-us/library/jj658585.aspx

gcc and clang may have similar capabilities

score 0 · Answer 5 · edited May 23 '17 at 12:10

You can always unroll a for loop. Even if you don't know the number of iterations it should do with a trick called Duff's device

Also see the explanation here on stackoverflow: How does Duff's device work?

You can have an interleaved switch and while loop, and let the while loop process, say, 4 items at once. if you'd like to process 6 items, you can then cheat by jumping to the second last item in the loop processing 2+4=6 items:

int  n = 6;
int it = n / 4;
int check = 0;
switch (n % 4) {
  case 0: do { check += 1;
  case 3:      check += 1;
  case 2:      check += 1;
  case 1:      check += 1;
} while (it--);
}
printf("processed %i items\n", check);

Can Compiler Optimize Loop with Variable Length?

5 Answers5

Linked