Extended assembly, floating point division

Question

I'm trying to use extended assembly to divide an array of 32bit floats located in a 4float vector structure.

This is the compiler error:

sisd.c: In function ‘divSISD’:

sisd.c:182:9: error: ‘asm’ operand has impossible constraints
     asm(
     ^~~

Code:

void divSISD(Vector4f* vecA, Vector4f* vecB, Vector4f* result, int size){


   for(int i = 0; i < size; i++){

    asm( 

        "fld %4\n\t"
        "fdiv %5\n\t"
        "fstp %0\n\t"

        "fld %6\n\t"
        "fdiv %7\n\t"
        "fstp %1\n\t"

        "fld %8\n\t"
        "fdiv %9\n\t"
        "fstp %2\n\t"

        "fld %10\n\t"
        "fdiv %11\n\t"
        "fstp %3\n\t"

:       "=m" (result[i].a), 
        "=m" (result[i].b), 
        "=m" (result[i].c),
        "=m" (result[i].d)  
:       "m" (vecA[i].a), "m" (vecB[i].a), 
        "m" (vecA[i].b), "m" (vecB[i].b), 
        "m" (vecA[i].c), "m" (vecB[i].c), 
        "m" (vecA[i].d), "m" (vecB[i].d) 

);
}
}

This appears to work fine if I use the non pointer struct type, like this:

void divSISD(Vector4f vecA, Vector4f vecB, Vector4f result, int size){
    asm( 

        "fld %4\n\t"
        "fdiv %5\n\t"
        "fstp %0\n\t"

        "fld %6\n\t"
        "fdiv %7\n\t"
        "fstp %1\n\t"

        "fld %8\n\t"
        "fdiv %9\n\t"
        "fstp %2\n\t"

        "fld %10\n\t"
        "fdiv %11\n\t"
        "fstp %3\n\t"


 :      "=m" (result.a), 
        "=m" (result.b), 
        "=m" (result.c),
        "=m" (result.d)  
 :      "m" (vecA.a), "m" (vecB.a), 
        "m" (vecA.b), "m" (vecB.b), 
        "m" (vecA.c), "m" (vecB.c), 
        "m" (vecA.d), "m" (vecB.d) 

        );
}

I can't understand why this wouldn't work as the "m" constraint should apply to both of the situations.

Just out of curiosity : why are you using x87 instuctions when you've got arrays of floats as your input parameters? These are much better suited to SIMD operations. Are you sure that you don't want to use SSE/AVX instructions (in this case, `DIVPS` comes to mind) instead? — Daniel Kamil Kozar, Apr 18 '18 at 23:02
@DanielKamilKozar i'm actually doing both, measuring execution time for both SIMD/SISD and then comparing them, that's my uni assignment actually (not sure if it's allowed here, though i'm not asking directly for the whole thing) — Łukasz Stawikowski, Apr 18 '18 at 23:19
You can ask whatever you want as long as it's interesting / worth answering / might be useful to future readers. [ask]. It's up to you to deal with the ethics of getting help for your homework. See also [How do I ask and answer homework questions?](//meta.stackoverflow.com/q/334822). — Peter Cordes, Apr 18 '18 at 23:27
Did you actually have to use x87 for scalar? `divss %xmm0, %xmm1` does scalar single-precision division on the low element, leaving the rest unmodified. You can use vector shuffles instead of loading/storing 4 elements separately, but make sure to unpack to separate registers and shuffle back together so you don't create serial dependencies between separate `divss` instructions for the same vector. And BTW, on Haswell/Skylake `divss` and 128-bit `divps` have the same high throughput, better than `fdiv`. But 256-bit `vdivps ymm` has worse throughput. http://agner.org/optimize/ — Peter Cordes, Apr 18 '18 at 23:34
Works for me with optimization enabled; https://godbolt.org/g/HsijVm; probably without optimization gcc ran out of registers trying to use a different pointer register for each addressing mode. You wouldn't have this problem if you split it into separate asm statements for each operation (e.g. in a simple wrapper function), or if you wrote the whole loop in asm yourself instead of using inline asm for the loop body by making the compiler generate the loop control instructions. — Peter Cordes, Apr 18 '18 at 23:40
I knew I'd seen this before, turns out it was me that wrote the answer showing the with/without optimization code-gen / register-use choices by gcc in the duplicate I found with lots of `"m"` operands not compiling without optimization :P — Peter Cordes, Apr 18 '18 at 23:44

Extended assembly, floating point division

0 Answers0