SSE strangeness with Functions

Question

I've been playing around with D's inline assembler and SSE, but found something I don't understand. When I try to add two float4 vectors immediately after declaration, the calculation is correct. If I put the calculation in a separate function, I get a series of nans.

//function contents identical to code section in unittest
float4 add(float4 lhs, float4 rhs)
{
    float4 res;
    auto lhs_addr = &lhs;
    auto rhs_addr = &rhs;
    asm
    {
        mov RAX, lhs_addr;
        mov RBX, rhs_addr;
        movups XMM0, [RAX];
        movups XMM1, [RBX];

        addps XMM0, XMM1;
        movups res, XMM0;
    }
    return res;
}

unittest
{
    float4 lhs = {1, 2, 3, 4};
    float4 rhs = {4, 3, 2, 1};

    println(add(lhs, rhs)); //float4(nan, nan, nan, nan)

    //identical code starts here
    float4 res;
    auto lhs_addr = &lhs;
    auto rhs_addr = &rhs;
    asm
    {
        mov RAX, lhs_addr;
        mov RBX, rhs_addr;
        movups XMM0, [RAX];
        movups XMM1, [RBX];

        addps XMM0, XMM1;
        movups res, XMM0;
    } //end identical code
    println(res); //float4(5, 5, 5, 5)
}

The assembly is functionally identical (as far as I can tell) to this link.

Edit: I'm using a custom float4 struct (for now, its just an array) because I want to be able to have an add function like float4 add(float4 lhs, float rhs). For the moment, that results in a compiler error like this:

Error: floating point constant expression expected instead of rhs

Note: I'm using DMD 2.071.0

I hope you're not planning to actually use code like that for performance. You're forcing the compiler to store the vectors to memory, and then store those addresses to memory. Then you're writing code that loads the addresses from memory and re-loads the vectors. (In the Windows `vectorcall` ABI, and in the SysV ABI used by all other AMD64 systems, vector args are passed in vector registers). In D, IDK if `lhs_addr` can actually be a register, but it's still a useless reg-reg move at best. Ideally there's a syntax for asking for the vectors in specific regs, like GNU C inline asm. — Peter Cordes, Apr 20 '16 at 23:34
I'm trying to write my own float4 type because `float4 rhs_vec = [rhs, rhs, rhs, rhs];` is disallowed with the error in the question. At the moment I've essentially copy-pasted the code from the link, and did some minor tweaking to (hopefully) make it work in D. At the moment, it just has to work. However, what would you do instead for the section? This is only my third foray into assembly, so any correction would be appreciated. — Straivers, Apr 20 '16 at 23:48
I don't know D, I just saw this question because of the SSE and inline-asm tags. Does D have anything like the Intel C intrinsics? `__m128 my_vec = _mm_add_ps(vec1, vec2);`? If so, you'll probably do better with that. If D's inline asm syntax is limited to passing data through memory, not registers, then [it's only useful if you want to write a whole loop inside it, not as a wrapper for a few instructions](http://stackoverflow.com/questions/3323445/what-is-the-difference-between-asm-and-asm/35959859#3595985). If it support GNU-C style asm statements, then use that. — Peter Cordes, Apr 20 '16 at 23:59
Well, just tried `__simd(XMM.ADDPS, lhs, rhs)`. I think it added the pointers together... — Straivers, Apr 21 '16 at 00:07

score 2 · Answer 1 · edited Apr 23 '16 at 09:23

2

Your code is wierd, what version of dmd do you use? This works as excpected:

import std.stdio;
import core.simd;

float4 add(float4 lhs, float4 rhs)
{
    float4 res;
    auto lhs_addr = &lhs;
    auto rhs_addr = &rhs;
    asm
    {
        mov RAX, lhs_addr;
        mov RBX, rhs_addr;
        movups XMM0, [RAX];
        movups XMM1, [RBX];

        addps XMM0, XMM1;
        movups res, XMM0;
    }
    return res;
}

void main()
{
    float4 lhs = [1, 2, 3, 4];
    float4 rhs = [4, 3, 2, 1];

    auto r = add(lhs, rhs);
    writeln(r.array); //float4(5, 5, 5, 5)

    //identical code starts here
    float4 res;
    auto lhs_addr = &lhs;
    auto rhs_addr = &rhs;
    asm
    {
        mov RAX, lhs_addr;
        mov RBX, rhs_addr;
        movups XMM0, [RAX];
        movups XMM1, [RBX];

        addps XMM0, XMM1;
        movups res, XMM0;
    } //end identical code
    writeln(res.array); //float4(5, 5, 5, 5)
}

edited Apr 23 '16 at 09:23

Max Alibaev

681
7
17

answered Apr 20 '16 at 05:12

Kozzi11

2,413
12
17

Thank you for your help, but I just realized I failed to include some things in my question, they've since been added to the question. – Straivers Apr 20 '16 at 19:00
@Straivers , did Kozzi11 actually answer your question or not? – DejanLekic Apr 21 '16 at 16:54
No. Or rather, he answered the original question, which made me realize that I phrased the question incorrectly. An error that has since been corrected. – Straivers Apr 21 '16 at 22:55
@Straivers can you please put your full code somewhere. It would be really helpful. – Kozzi11 Apr 23 '16 at 05:13

SSE strangeness with Functions

1 Answers1