I'm working on improving the runtime speed of a C program while trying to avoid the use of inline assembly. The program is not far from 100x speedup compared to its initial performance, which means improvements are getting harder to come by.
While inspecting the assembly code generated by VS2017 (in Release x64 mode, with /O2 compiler optimizations), it came to my attention that one of my hotspots in the program corresponds to the following code (anonymized):
bool j, k = false;
do
{
j = false;
for (i = 0; i < c->a; i++)
{
int x, y, z;
x = c->b[i];
y = (x/c->d) + 1;
z = (x % c->d) + 1;
if (c->e[y][z] == 0)
{
c->b[i--] = c->b[--c->a];
continue;
}
else if (c->e[y][z] + c->g[y][z] == c->h[y][z] - '0')
{
f(c, y, z);
k = true;
j = true;
c->b[i--] = c->b[--c->a];
continue;
}
}
} while (j);
return k;
In particular, I call attention to the following statements (c->d is an int):
y = (x/c->d) + 1;
z = (x % c->d) + 1;
This is a division followed by remainder using the same parameters in both cases. Since the division instruction in x86/x64 returns both the division and the remainder at the same time, I expected this would be compiled to a single division instruction. Yet the compiler's output basically does everything twice, including reloading c->d from memory:
00007FF790061FA0 42 8B 04 1F mov eax,dword ptr [rdi+r11]
00007FF790061FA4 99 cdq
00007FF790061FA5 F7 7E 28 idiv eax,dword ptr [rsi+28h]
00007FF790061FA8 4C 63 D0 movsxd r10,eax
00007FF790061FAB 42 8B 04 1F mov eax,dword ptr [rdi+r11]
00007FF790061FAF 99 cdq
00007FF790061FB0 F7 7E 28 idiv eax,dword ptr [rsi+28h]
... a few instructions later ...
00007FF790061FC3 4C 63 D2 movsxd r10,edx
I've tried various transformations to my code to make it emit a single division instruction, unsuccessfully. The closest I got was transforming the above C code block into this:
y = (x/c->d);
z = (x % c->d);
y++;
This led to the following code being emitted:
00007FF7942B1FA0 42 8B 04 17 mov eax,dword ptr [rdi+r10]
00007FF7942B1FA4 99 cdq
00007FF7942B1FA5 F7 7E 28 idiv eax,dword ptr [rsi+28h]
00007FF7942B1FA8 4C 63 F2 movsxd r14,edx
00007FF7942B1FAB 44 8D 60 01 lea r12d,[rax+1]
00007FF7942B1FAF 48 98 cdqe
00007FF7942B1FB1 48 8D 14 C5 08 00 00 00 lea rdx,[rax*8+8]
Unfortunately, without adding z++
as well, the code is wrong and my program no longer works, as expected. As soon as I add back z++
after y++
, the first version of the assembly code is emitted again.
What kind of code transformations or compiler flags could I use to coerce VS2017 to emit efficient code in this case?