I have a performance problem - I can't beat the release version speed of the compiler generated code at all. It is 25% slower. The function I wrote is being called around 20 million times in my test, so making it run faster will pay off.
the code in C++ is pretty simple:
static inline char GetBit(char *data, size_t bit)
{
return 0 != (data[bit / 8] & (1 << (bit % 8)));
}
And this is the version I wrote for 64bit MASM:
mov rax, rdx
mov r10, 8h
xor rdx, rdx
div rax, r10
mov al, byte ptr [rax+rcx]
mov bl, 1h
mov cl, dl
shl bl, cl
and al, bl
shr al, cl
ret
Well I'm not much of an assembler guy, but I don't think the compiler can make 25% faster code just creating better assembly. So the trick is [probably] in the function call. It respects the inline keyword for the C++ code and generates no call, but I just can't make it working for the asm code:
extern "C" inline char GetBitAsm(char *data, size_t bit);
I've disassembled the code using dumpbin and I can clearly see my code + function call. while no call is generated for the compiler's version:
mov rdx, qword ptr [bit]
mov rcx, qword ptr [data]
call GetBitAsm (013F588EFDh)
mov byte ptr [isbit], al
There are additionally 2 readings and one writing to the memory, while in what the compiler generates, there are probably only 1 reading. I read somewhere that div op-code takes around 20 cycles, while single memory access costs 100 cycles at least. So removing mov rdx and mov rcx from the memory, replacing them with values from registers in the parent function will, I think, do the trick
Questions:
Is that really the reason it runs so slow ?
How to make function written in asm inline in release version ?
How can I further enhance my assembly code, to make it even faster ?