1

I'm working on improving the runtime speed of a C program while trying to avoid the use of inline assembly. The program is not far from 100x speedup compared to its initial performance, which means improvements are getting harder to come by.

While inspecting the assembly code generated by VS2017 (in Release x64 mode, with /O2 compiler optimizations), it came to my attention that one of my hotspots in the program corresponds to the following code (anonymized):

bool j, k = false;

do
{
    j = false;

    for (i = 0; i < c->a; i++)
    {
        int x, y, z;

        x = c->b[i];

        y = (x/c->d) + 1;
        z = (x % c->d) + 1;

        if (c->e[y][z] == 0)
        {
            c->b[i--] = c->b[--c->a];
            continue;
        }
        else if (c->e[y][z] + c->g[y][z] == c->h[y][z] - '0')
        {
            f(c, y, z);
            k = true;
            j = true;
            c->b[i--] = c->b[--c->a];
            continue;
        }
    }
} while (j);

return k;

In particular, I call attention to the following statements (c->d is an int):

    y = (x/c->d) + 1;
    z = (x % c->d) + 1;

This is a division followed by remainder using the same parameters in both cases. Since the division instruction in x86/x64 returns both the division and the remainder at the same time, I expected this would be compiled to a single division instruction. Yet the compiler's output basically does everything twice, including reloading c->d from memory:

00007FF790061FA0 42 8B 04 1F          mov         eax,dword ptr [rdi+r11]  
00007FF790061FA4 99                   cdq  
00007FF790061FA5 F7 7E 28             idiv        eax,dword ptr [rsi+28h]  
00007FF790061FA8 4C 63 D0             movsxd      r10,eax  
00007FF790061FAB 42 8B 04 1F          mov         eax,dword ptr [rdi+r11]  
00007FF790061FAF 99                   cdq  
00007FF790061FB0 F7 7E 28             idiv        eax,dword ptr [rsi+28h]  
... a few instructions later ...
00007FF790061FC3 4C 63 D2             movsxd      r10,edx

I've tried various transformations to my code to make it emit a single division instruction, unsuccessfully. The closest I got was transforming the above C code block into this:

    y = (x/c->d);
    z = (x % c->d);
    y++;

This led to the following code being emitted:

00007FF7942B1FA0 42 8B 04 17          mov         eax,dword ptr [rdi+r10]  
00007FF7942B1FA4 99                   cdq  
00007FF7942B1FA5 F7 7E 28             idiv        eax,dword ptr [rsi+28h]  
00007FF7942B1FA8 4C 63 F2             movsxd      r14,edx  
00007FF7942B1FAB 44 8D 60 01          lea         r12d,[rax+1]  
00007FF7942B1FAF 48 98                cdqe  
00007FF7942B1FB1 48 8D 14 C5 08 00 00 00 lea         rdx,[rax*8+8]  

Unfortunately, without adding z++ as well, the code is wrong and my program no longer works, as expected. As soon as I add back z++ after y++, the first version of the assembly code is emitted again.

What kind of code transformations or compiler flags could I use to coerce VS2017 to emit efficient code in this case?

swineone
  • 2,296
  • 1
  • 18
  • 32
  • 1
    Can you minimize your code, so that we can try it out? https://godbolt.org/g/HBWbbh seems to do the right things. – mch Jul 23 '18 at 11:38
  • 1
    I wonder whether preceeding with `int d = c->d ;`, then using the simpler expressions `x / d` and `x % d` might allow the compiler to spot the idiom. It would certainly prevent the reloading of `c->d`. – Clifford Jul 23 '18 at 11:39
  • See also the second highest voted answer at https://stackoverflow.com/questions/7070346/c-best-way-to-get-integer-division-and-remainder regarding [`std::div`](https://en.cppreference.com/w/cpp/numeric/math/div) – Clifford Jul 23 '18 at 11:42
  • @mch : not if you add a `struct` in the mix : https://godbolt.org/g/oXrbAC - I suspect strict aliasing might be the culprit. And then @Clifford's suggestion does the trick. – Sander De Dycker Jul 23 '18 at 11:43
  • @SanderDeDycker can be, using a local variable as result like in OP's code solves it: https://godbolt.org/g/ZfXKtL – mch Jul 23 '18 at 11:46
  • 1
    Posting the types used and a stand-alone test bench would help gain more useful feedback. As is, post centers on what OP thinks is the bottleneck. – chux - Reinstate Monica Jul 23 '18 at 13:36
  • Re-architect-ed code may allow `e[0][x]` and skip the `y = (x/c->d) + 1; z = (x % c->d) + 1;`. Yet the overall picture is not shown. "improvements are getting harder to come by." usually means you have to improve the apporach, not just details. – chux - Reinstate Monica Jul 23 '18 at 13:41
  • `c->e` is not a bidimensional array but rather an array of arrays, so the last suggested optimization is not applicable. Strangely enough, I tried to copy this function and its dependencies to the godbolt.org site, but there the generated code uses only one `idiv` instruction. I even tried to copy the command-line switches from my project, thinking one of the switches might have made the difference, but they didn't. So I'm not sure what's going on. – swineone Jul 23 '18 at 16:19

2 Answers2

1

If you explicitly code it to load c->d only once as follows;

   int d = c->d ;
   y = (x / d) + 1;
   z = (x % d) + 1;

a single division operation is generated.

Clifford
  • 88,407
  • 13
  • 85
  • 165
  • I forgot to mention but I tried this as well, but it appears to have made no difference. If I had to guess, it has to do with the users of `y` and `z` later. – swineone Jul 23 '18 at 11:58
  • @swineone : Then you need to add that information to the question. – Clifford Jul 23 '18 at 12:12
  • @swineone : Interesting hacking Sander De Dycker's example https://godbolt.org/g/C3GG8d it works, but my somewhat different test at https://godbolt.org/g/C9r8Qh if does not. YMMV. – Clifford Jul 23 '18 at 12:34
  • @Clifford : the use of pointers in my link probably turned it into a different problem. When using the exact code from the OP, the issue is not observed either : https://godbolt.org/g/e92nHu – Sander De Dycker Jul 23 '18 at 12:40
  • @SanderDeDycker: yeah, even a good compiler (gcc/clang) can't optimize when writing to `&y` could modify `&x`. https://godbolt.org/g/UxGKxc. But assigning the quotient and remainder to local variables before assigning to either reference var allows gcc and MSVC to optimize. (Note that MSVC outputs asm in reverse order that functions appear in the source, unlike gcc.) The OP really needs a [mcve]. – Peter Cordes Jul 23 '18 at 17:22
0

I found one possible, although admittedly very ugly, solution to the issue.

Since incrementing both y and z appears to be what causes the problem in the original snippet:

    x = c->b[i];
    y = (x/c->d) + 1;
    z = (x % c->d) + 1;

I modified x so that y would have the same result, by adding c->d to it, but without the explicit increment, and left the increment in z since a single increment doesn't appear to prevent the generation of the desired code:

    x = c->b[i] + c->d;
    y = x/c->d;
    z = (x % c->d) + 1;

This generates code with a single IDIV for me:

00007FF7ED121FA0 42 8B 04 17          mov         eax,dword ptr [rdi+r10]  
00007FF7ED121FA4 03 46 28             add         eax,dword ptr [rsi+28h]  
00007FF7ED121FA7 99                   cdq  
00007FF7ED121FA8 F7 7E 28             idiv        eax,dword ptr [rsi+28h]  
00007FF7ED121FAB 48 63 E8             movsxd      rbp,eax  
00007FF7ED121FAE 48 8B 46 08          mov         rax,qword ptr [rsi+8]  
00007FF7ED121FB2 4C 63 DA             movsxd      r11,edx  
00007FF7ED121FB5 44 8D 62 01          lea         r12d,[rdx+1]
swineone
  • 2,296
  • 1
  • 18
  • 32