0

I've noticed that when using my inline assembly code is either incredibly slow or stops compared to my C++ code which finishes very quickly. I'm curious as to why to happens when I call upon the inline assembler in a different function as opposed to having the assembler where the function was called. I tested both ways and found that my program did not freeze when omitting the function.

    __asm {

    push dword ptr[rw] //rw is a C++ floating-point variable
    fld[esp] // Using the stack as temporary storage in order to insert it into the FPU
    add esp, 4 //preserving the memory

    push dword ptr[lwB]
    fld[esp]
    add esp, 4

    fsubp ST(1), ST(0) // Subtracting rw - lwB

    push dword ptr[sp]
    fld[esp]
    add esp, 4

    fdivp ST(1), ST(0) // Dividing previous resultant by span -> (rw - lwB) / sp

    push dword ptr[dimen]
    fld[esp]
    add esp, 4

    fmulp ST(1), ST(0) // Multiplying previous resultant by dimension > ((rw - lwB) / (sp)* dimen)

    sub esp, 4 // Allocating space in order to save result temporarily to ram then to eax then to a C++ variable
    fstp[esp]
    pop eax 
    mov fCord, eax

    }
    return (int)fCord; //fCord is also a floating-point C++ variable

The much faster C++ Code:

    return (int)(((rw - lwB) / (sp)* dimen));
michael874
  • 23
  • 4
  • 7
    Why not look at the assembly the compiler generates? – Raymond Chen Mar 30 '17 at 05:03
  • 4
    Today, compilers are optimizing better than humans realistically can. So just give up the idea of coding in assembler some efficient code. Trust your optimizing compiler. – Basile Starynkevitch Mar 30 '17 at 05:03
  • Your assembly code is for the x87, which is obsolescent (at best). Your compiler is undoubtedly using at least SSE 4.2, and quite possibly something newer and faster still (e.g., AVX or AVX 2). – Jerry Coffin Mar 30 '17 at 05:06
  • AVX and SSE help in SIMD operations. For scalar linear operation, they may not make much difference. – prashanthns Mar 30 '17 at 05:17
  • 2
    Your assembly code is pretty bad, lots of unecessary pushes, so you would expect the compiler to generate better code. For example Microsoft's C++ compiler generates code like this: https://godbolt.org/g/1qNcFA – Ross Ridge Mar 30 '17 at 05:30
  • 1
    Are the variables float or double? How is the assembler supposed to know the variable size with the instruction `fld [esp]` ? Why can't the assembly code use `fld dword ptr [rw]` instead of pushing onto stack and loading? – rcgldr Mar 30 '17 at 06:38
  • 2
    @prashanthns: even when you're not doing SIMD, they can help. Just for example, if you want single precision division, on the x87 you have to change the floating point control word to set the precision to 32 bits, then do the division. With AVX, you have a separate instruction specifically for single precision. Oh, and if you're doing this in a loop, auto-vectorizing is entirely possible as well. – Jerry Coffin Mar 30 '17 at 06:53

2 Answers2

5

Today's compilers are much more advanced which can do branch prediction, memory operation reduction etc compared to handcoded assembly. This does not mean hand coded assembly is bad all the time but for most of the cases compiler can do equally good or better job of optimization when configured with right flags. In your case, you have used lot of stack operation and everyone of them leads to a memory load/store which is expensive in terms of CPU cycles. This could be the reason for slower performance. See the disassmbly code of your c++ implementation when compiled in release mode for comparing your handcoded assembly with the compiler generated output.

prashanthns
  • 339
  • 2
  • 12
0

Thanks all, I had a pretty strange issue but it might just be common. I had an inline assembler in a different function and called upon it for calculations. After moving this function into where it was called instead, I have fixed the issue. I'm sure there is a bigger lesson at hand though.

Obviously, the code is inefficient and the comments/answers are helpful in general, although my problem was a bit different.

For anybody wondering, here is the optimal assembly code that the compiler built:

float finCord;
 __asm {
    movss       xmm0, dword ptr[rw]
    subss       xmm0, dword ptr[lwB]
    divss       xmm0, dword ptr[sp]
    mulss       xmm0, dword ptr[dimen]
    movss         dword ptr[fCord],xmm0
    }
int answer = (int)finCord;
michael874
  • 23
  • 4