9

Is there a way to instruct GCC (I'm using 4.8.4) to unroll the while loop in the bottom function completely, i.e., peel this loop? The number of iterations of the loop is known at compilation time: 58.

Let me first explain what I have tried.

By checking GAS ouput:

gcc -fpic -O2 -S GEPDOT.c

12 registers XMM0 - XMM11 are used. If I pass the flag -funroll-loops to gcc:

gcc -fpic -O2 -funroll-loops -S GEPDOT.c

the loop is only unrolled two times. I checked the GCC optimization options. GCC says that -funroll-loops will turn on -frename-registers as well, so when GCC unrolls a loop, its prior choice for register allocation is to use "left over" registers. But there are only 4 left over XMM12 - XMM15, so GCC can only unroll 2 times at its best. Had there been 48 instead of 16 XMM registers available, GCC will unroll the while loop 4 times without trouble.

Yet I did another experiment. I first unrolled the while loop two time manually, obtaining a function GEPDOT_2. Then there is no difference at all between

gcc -fpic -O2 -S GEPDOT_2.c

and

gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c

Since GEPDOT_2 already used up all registers, no unrolling is performed.

GCC does register renaming to avoid potential false dependency introduced. But I know for sure that there will be no such potential in my GEPDOT; even if there is, it is not important. I tried unrolling the loop myself, and unrolling 4 times is faster than unrolling 2 times, faster than no unrolling. Of course I can manually unroll more times, but it is tedious. Can GCC do this for me? Thanks.

// C file "GEPDOT.c"
#include <emmintrin.h>

void GEPDOT (double *A, double *B, double *C) {
  __m128d A1_vec = _mm_load_pd(A); A += 2;
  __m128d B_vec = _mm_load1_pd(B); B++;
  __m128d C1_vec = A1_vec * B_vec;
  __m128d A2_vec = _mm_load_pd(A); A += 2;
  __m128d C2_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C3_vec = A1_vec * B_vec;
  __m128d C4_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C5_vec = A1_vec * B_vec;
  __m128d C6_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  __m128d C7_vec = A1_vec * B_vec;
  A1_vec = _mm_load_pd(A); A += 2;
  __m128d C8_vec = A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  int k = 58;
  /* can compiler unroll the loop completely (i.e., peel this loop)? */
  while (k--) {
    C1_vec += A1_vec * B_vec;
    A2_vec = _mm_load_pd(A); A += 2;
    C2_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C3_vec += A1_vec * B_vec;
    C4_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C5_vec += A1_vec * B_vec;
    C6_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    C7_vec += A1_vec * B_vec;
    A1_vec = _mm_load_pd(A); A += 2;
    C8_vec += A2_vec * B_vec;
    B_vec = _mm_load1_pd(B); B++;
    }
  C1_vec += A1_vec * B_vec;
  A2_vec = _mm_load_pd(A);
  C2_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  C3_vec += A1_vec * B_vec;
  C4_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B); B++;
  C5_vec += A1_vec * B_vec;
  C6_vec += A2_vec * B_vec;
  B_vec = _mm_load1_pd(B);
  C7_vec += A1_vec * B_vec;
  C8_vec += A2_vec * B_vec;
  /* [write-back] */
  A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec;
  A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec;
  A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec;
  A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec;
  A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec;
  A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec;
  A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec;
  A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec;
  _mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec);
  _mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec);
  _mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec);
  _mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec);
  }

update 1

Thanks to the comment by @user3386109, I would like to extend this question a little bit. @user3386109 raises a very good question. Actually I do have some doubt on compiler's ability for optimal register allocation, when there are so many parallel instructions to schedule.

I personally think that a reliable way is to first code the loop body (which is key to HPC) in asm inline assembly, then duplicate it as many times as I want. I had an unpopular post earlier this year: inline assembly. The code was a little different because the number of loop iterations, j, is a function argument hence unknown at compilation time. In that case I can not fully unroll the loop, so I only duplicated the assembly code twice, and converted the loop into a label and jump. It turned out that the resulting performance of my written assembly is about 5% higher than compiler generated assembly, which might suggest that compiler fails to allocate registers in our expected, optimal manner.

I was (and am still) a baby in assembly coding, so that serves a good case study for me to learn a little bit on x86 assembly. But in a long run I do not incline to code GEPDOT with a big proportion for assembly. There are mainly three reasons:

  1. asm inline assembly has been critisized for not being portable. Though I don't understand why. Perhaps because different machines have different registers clobbered?
  2. Compiler is also getting better. So I would still prefer to algorithmic optimization and better C coding habit to assist compiler in generating good output;
  3. The last reason is more important. The number of iterations may not always be 58. I am developing a high performance matrix factorization subroutine. For a cache blocking factor nb, the number of iterations would be nb-2. I am not going to put nb as a function argument, as I did in the earlier post. This is a machine specific parameter will be defined as a macro. So the number of iterations is known at compiled time, but may vary from machine to machine. Guess how much tedious work I have to do in manual loop unrolling for a variety of nb. So if there is a way to simply instruct the compiler to peel a loop, that is great.

I would be very appreciated if you can also share some experience in producing high performance, yet portable library.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Did you try `-funroll-all-loops`? – Nate Eldredge Mar 20 '16 at 05:37
  • So if you manually duplicate the body of that loop 58 times, does GCC do a decent job managing the register usage? I ask because it seems simple enough to write a preprocessor that will unroll the loop. For example, replace the `while` with `preproc__repeat(58)`. Then write a preprocessor that searches for `preproc__repeat`, extracts the number, and duplicates the body the indicated number of times. – user3386109 Mar 20 '16 at 06:15
  • 1) Different processors don't just clobber different registers. They don't even *have* the same registers. And they don't have the same instructions (although _mm_load1_pd is already going to be somewhat processor specific). Also, different compilers treat inline asm instructions differently. Inline asm that works on one compiler can compile, but fail to produce the correct results on another. – David Wohlferd Mar 20 '16 at 08:19
  • Close. Most c compilers seem to have some form of asm inlining. However, there is no standard about their semantics. One compiler might clobber *all* registers after invoking an asm, just to be safe. Gcc's basic asm doesn't clobber ANY. gcc also doesn't perform a [memory clobber](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers) for basic asm, but other compilers do. This can introduce subtle differences between compilers. I don't know what you mean by 'dominant architecture.' x86 is very common today, but so is ARM. And what happens to your asm if intel adds more xmm regs to i8? – David Wohlferd Mar 20 '16 at 08:49
  • 1
    Why do you think there is a benefit in unrolling this loop completely? The code might be slower if completely unrolled due to µop caching that is present otherwise. – fuz Mar 20 '16 at 09:31
  • @AlphaBetaGamma The paper is **old,** back then, the µop cache didn't exist. See my answer for how to unroll the loop completely. – fuz Mar 20 '16 at 09:43
  • If this is supposed to be a 4×4 dot product for vectors of length 256, the code looks suspect. For example, you increment pointer A by 6 + 58×4 = 238 elements. Have you compared the results against a naive implementation to verify your computation is correct before you've started optimizing it? I'm not being snarky: I **have to** constantly (unit-)test each version of my optimized functions myself, as otherwise I end up with unfixable code as bugs pile on top of each other.. – Nominal Animal Mar 20 '16 at 09:51
  • Ok. In case you're interested, I added some example implementations how I'd calculate *c* = *a* × *b* + *c*, where *a* is a 4-row, *n*-column matrix contiguous but transposed in memory (in other words, in column-major order), *b* is a *n*-row, 4-column matrix contiguous in memory (in row-major order), and *c* is a 4-row, 4-column matrix also contiguous in memory (in row-major order). One version is for SSE2/SSE3, and one for AVX/AVX2. – Nominal Animal Mar 20 '16 at 13:14

2 Answers2

4

This is not an answer, but might be of interest to others trying to vectorize matrix multiplications with GCC.

Below, I assume c is a 4×4 matrix in row-major order, a is a 4-row, n-column matrix in column-major order (transposed), b is a 4-column, n-row matrix in row-major order, and the operation to calculate is c = a × b + c, where × denotes matrix multiplication.

The naive function to accomplish this is

void slow_4(double       *c,
            const double *a,
            const double *b,
            size_t        n)
{
    size_t row, col, i;

    for (row = 0; row < 4; row++)
        for (col = 0; col < 4; col++)
            for (i = 0; i < n; i++)
                c[4*row+col] += a[4*i+row] * b[4*i+col];
}

GCC generates pretty good code for SSE2/SSE3 using

#if defined(__SSE2__) || defined(__SSE3__)

typedef  double  vec2d  __attribute__((vector_size (2 * sizeof (double))));

void fast_4(vec2d       *c,
            const vec2d *a,
            const vec2d *b,
            size_t       n)
{
    const vec2d *const b_end = b + 2L * n;

    vec2d s00 = c[0];
    vec2d s02 = c[1];
    vec2d s10 = c[2];
    vec2d s12 = c[3];
    vec2d s20 = c[4];
    vec2d s22 = c[5];
    vec2d s30 = c[6];
    vec2d s32 = c[7];

    while (b < b_end) {
        const vec2d b0 = b[0];
        const vec2d b2 = b[1];
        const vec2d a0 = { a[0][0], a[0][0] };
        const vec2d a1 = { a[0][1], a[0][1] };
        const vec2d a2 = { a[1][0], a[1][0] };
        const vec2d a3 = { a[1][1], a[1][1] };
        s00 += a0 * b0;
        s10 += a1 * b0;
        s20 += a2 * b0;
        s30 += a3 * b0;
        s02 += a0 * b2;
        s12 += a1 * b2;
        s22 += a2 * b2;
        s32 += a3 * b2;
        b += 2;
        a += 2;
    }

    c[0] = s00;
    c[1] = s02;
    c[2] = s10;
    c[3] = s12;
    c[4] = s20;
    c[5] = s22;
    c[6] = s30;
    c[7] = s32; 
}

#endif

For AVX, GCC can do even better with

#if defined(__AVX__) || defined(__AVX2__)

typedef  double  vec4d  __attribute__((vector_size (4 * sizeof (double))));

void fast_4(vec4d       *c,
            const vec4d *a,
            const vec4d *b,
            size_t       n)
{
    const vec4d *const b_end = b + n;

    vec4d s0 = c[0];
    vec4d s1 = c[1];
    vec4d s2 = c[2];
    vec4d s3 = c[3];

    while (b < b_end) {
        const vec4d bc = *(b++);
        const vec4d ac = *(a++);
        const vec4d a0 = { ac[0], ac[0], ac[0], ac[0] };
        const vec4d a1 = { ac[1], ac[1], ac[1], ac[1] };
        const vec4d a2 = { ac[2], ac[2], ac[2], ac[2] };
        const vec4d a3 = { ac[3], ac[3], ac[3], ac[3] };
        s0 += a0 * bc;
        s1 += a1 * bc;
        s2 += a2 * bc;
        s3 += a3 * bc;
    }

    c[0] = s0;
    c[1] = s1;
    c[2] = s2;
    c[3] = s3;
}

#endif

The SSE3 version of the generated assembly using gcc-4.8.4 (-O2 -march=x86-64 -mtune=generic -msse3) is essentially

fast_4:
        salq    $5, %rcx
        movapd  (%rdi), %xmm13
        addq    %rdx, %rcx
        cmpq    %rcx, %rdx
        movapd  16(%rdi), %xmm12
        movapd  32(%rdi), %xmm11
        movapd  48(%rdi), %xmm10
        movapd  64(%rdi), %xmm9
        movapd  80(%rdi), %xmm8
        movapd  96(%rdi), %xmm7
        movapd  112(%rdi), %xmm6
        jnb     .L2
.L3:
        movddup (%rsi), %xmm5
        addq    $32, %rdx
        movapd  -32(%rdx), %xmm1
        addq    $32, %rsi
        movddup -24(%rsi), %xmm4
        movapd  %xmm5, %xmm14
        movddup -16(%rsi), %xmm3
        movddup -8(%rsi), %xmm2
        mulpd   %xmm1, %xmm14
        movapd  -16(%rdx), %xmm0
        cmpq    %rdx, %rcx
        mulpd   %xmm0, %xmm5
        addpd   %xmm14, %xmm13
        movapd  %xmm4, %xmm14
        mulpd   %xmm0, %xmm4
        addpd   %xmm5, %xmm12
        mulpd   %xmm1, %xmm14
        addpd   %xmm4, %xmm10
        addpd   %xmm14, %xmm11
        movapd  %xmm3, %xmm14
        mulpd   %xmm0, %xmm3
        mulpd   %xmm1, %xmm14
        mulpd   %xmm2, %xmm0
        addpd   %xmm3, %xmm8
        mulpd   %xmm2, %xmm1
        addpd   %xmm14, %xmm9
        addpd   %xmm0, %xmm6
        addpd   %xmm1, %xmm7
        ja      .L3
.L2:
        movapd  %xmm13, (%rdi)
        movapd  %xmm12, 16(%rdi)
        movapd  %xmm11, 32(%rdi)
        movapd  %xmm10, 48(%rdi)
        movapd  %xmm9, 64(%rdi)
        movapd  %xmm8, 80(%rdi)
        movapd  %xmm7, 96(%rdi)
        movapd  %xmm6, 112(%rdi)
        ret

The AVX version of the generated assembly (-O2 -march=x86-64 -mtune=generic -mavx) is essentially

fast_4:
        salq       $5, %rcx
        vmovapd    (%rdi), %ymm5
        addq       %rdx, %rcx
        vmovapd    32(%rdi), %ymm4
        cmpq       %rcx, %rdx
        vmovapd    64(%rdi), %ymm3
        vmovapd    96(%rdi), %ymm2
        jnb        .L2
.L3:
        addq       $32, %rsi
        vmovapd    -32(%rsi), %ymm1
        addq       $32, %rdx
        vmovapd    -32(%rdx), %ymm0
        cmpq       %rdx, %rcx
        vpermilpd  $0, %ymm1, %ymm6
        vperm2f128 $0, %ymm6, %ymm6, %ymm6
        vmulpd     %ymm0, %ymm6, %ymm6
        vaddpd     %ymm6, %ymm5, %ymm5
        vpermilpd  $15, %ymm1, %ymm6
        vperm2f128 $0, %ymm6, %ymm6, %ymm6
        vmulpd     %ymm0, %ymm6, %ymm6
        vaddpd     %ymm6, %ymm4, %ymm4
        vpermilpd  $0, %ymm1, %ymm6
        vpermilpd  $15, %ymm1, %ymm1
        vperm2f128 $17, %ymm6, %ymm6, %ymm6
        vperm2f128 $17, %ymm1, %ymm1, %ymm1
        vmulpd     %ymm0, %ymm6, %ymm6
        vmulpd     %ymm0, %ymm1, %ymm0
        vaddpd     %ymm6, %ymm3, %ymm3
        vaddpd     %ymm0, %ymm2, %ymm2
        ja         .L3
.L2:
        vmovapd    %ymm5, (%rdi)
        vmovapd    %ymm4, 32(%rdi)
        vmovapd    %ymm3, 64(%rdi)
        vmovapd    %ymm2, 96(%rdi)
        vzeroupper
        ret

The register scheduling is not optimal, I guess, but it does not look atrocious either. I'm personally happy with the above, without trying to hand-optimize it at this point.

On a Core i5-4200U processor (AVX2-capable), the fast versions of the above functions compute the product of two 4×256 matrices in 1843 CPU cycles (median) for SSE3, and 1248 cycles for AVX2. That comes down to 1.8 and 1.22 cycles per matrix entry. The unvectorized slow version takes about 11 cycles per matrix entry, for comparison.

(The cycle counts are median values, i.e. half the tests were faster. I only ran some rough benchmarking with ~ 100k repeats or so, so do take these numbers with a grain of salt.)

On this CPU, cache effects are such that AVX2 at 4×512 matrix size is still at 1.2 cycles per entry, but at 4×1024, it drops to 1.4, at 4×4096 to 1.5, at 4×8192 to 1.8, and at 4×65536 to 2.2 cycles per entry. The SSE3 version stays at 1.8 cycles per entry up to 4×3072, at which point it starts slowing down; at 4×65536 it too is about 2.2 cycles per entry. I do believe this (laptop!) CPU is cache bandwidth limited at this point.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • @AlphaBetaGamma: :D The approach is a bit different, relying on GCC's vector types, instead of the standard Intel intrinsics (which are well supported by other C compilers, too). – Nominal Animal Mar 20 '16 at 13:21
  • @AlphaBetaGamma: No need; I'm happy enough if it is useful and informative. – Nominal Animal Mar 20 '16 at 18:55
3

Try to tweak the optimizer parameters:

gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100

This should do the trick.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • @AlphaBetaGamma You could try to experiment with the flags. At least `-O1` will be needed for `-funroll-loops` to work if I recall correctly. When I compile with `-mavx`, the register allocation is much better. If you replace it with inline assembly, it should still unroll, but I'm not an expert in how gcc works. – fuz Mar 20 '16 at 10:56
  • 1
    @AlphaBetaGamma With `-mavx`, the compiler emits three-operand instructions instead of two-operand instructions. This eliminates all move instructions. I assume that when no moves are left, register allocation is optimal. – fuz Mar 20 '16 at 11:27
  • @AlphaBetaGamma AVX defines a new instruction encoding with three operands. This also applies to xmm registers but is not backwards compatible, so it has to be explicitly enabled. – fuz Mar 20 '16 at 12:26
  • 1
    @AlphaBetaGamma The µop cache caches decoded instructions, the decoder is among the slowest parts of the the CPU, so skipping it is nice. It works best with very small loops (27 µops or less), loop unrolling defeats it as the loop becomes too large to fit in the cache. The cmp/jnz pair is so cheap, it doesn't affect the performance too much any way. It might even be free. But as always: benchmark when in doubt. – fuz Mar 20 '16 at 12:28
  • @AlphaBetaGamma No! A correctly predicted jump does not flush the pipeline and the conditional instruction in the loop will be predicted correctly (that's what branch prediction is optimized for). The µop cache caches microcode, i.e. already decoded instructions. That's why it's so useful. It's very small though. Loop unrolling isn't that useful as an optimization today as it was 20 years ago. – fuz Mar 20 '16 at 13:07
  • @AlphaBetaGamma The L1 cache stored machine code or data. It does not store decoded instructions, but it may store instruction boundaries for parallel decoding. But otherwise, yes. – fuz Mar 20 '16 at 14:51