What are the perfomance implications of an inheritance like this?

Question

I'm working with the DirectXMath (or XNAMath) library (defined in the DirectXMath.h header of the Windows SDK), as it appears to be really performant and offers everything that is needed for physics and rendering. However I found it to be quite verbose (Using XMStoreFloatX and XMLoadFloatX everywhere is tiring).

I am trying to make it a little easier to operate and came up with the idea to hide the Stores/Loads in assignment operators/conversion operators. As both of these are required to be member functions I came up with this code as an example:

struct Vector2F : public DirectX::XMFLOAT2 {
    inline Vector2F() : DirectX::XMFLOAT2() {};
    inline Vector2F(float x, float y) : DirectX::XMFLOAT2(x, y) {};
    inline Vector2F(float const * pArray) : DirectX::XMFLOAT2(pArray) {};

    inline Vector2F(DirectX::XMVECTOR vector) {
        DirectX::XMStoreFloat2(this, vector);
    }
    inline Vector2F& __vectorcall operator= (DirectX::XMVECTOR vector) {
        DirectX::XMStoreFloat2(this, vector);
        return *this;
    }

    inline __vectorcall operator DirectX::XMVECTOR() {
        return DirectX::XMLoadFloat2(this);
    }
};

As you can see it replicates the public interface of XMFLOAT2 and adds a constructor, an assignment operator and a conversion for XMVECTOR, which is the SIMD type DirectXMath uses for calculations. I intend to do this for every storage struct DirectXMath offers.

Perfomance is a really important factor for a math libary, thus my question is: What are the perfomance implications of such an inheritance? Is there any additional code generated (of course assuming full optimization) compared to normal usage of the library?

Intuitively I would say that the generated code should be exactly the same as when I'm using the verbose variant without these convenience operators, as I am essentially just renaming structs and functions. But maybe there are aspects I don't know about?

P.S. I'm a little concerned about the return type of the assignment operator, as it adds additional code. Would it be a good idea to omit the reference returning to optimize it?

I think you'll find that in optimised code, the performance penalty will be zero. — Richard Hodges, Sep 17 '16 at 10:44
If the returned value isn't used, the compiler will extremely likely optimize out `return *this;`. This is a common idiom that the compilers are well aware of. — Bo Persson, Sep 17 '16 at 11:09
You have two implicit conversions: 1) From XMVECTOR and 2) To XMVECTOR. That is a great possibility for ambiguities. Do not do that (do not use Vector2F). — , Sep 17 '16 at 11:57
@DieterLücking Of course I have the possibility to make those conversions explicit, but the stated intent of this code is to make dealing with DirectXMath less verbose. For anyone slightly used to DirectXMath the possible ambiguity should actually be a non issue. I will rethink this and decide later wheter to make conversions explicit. — LukeG, Sep 17 '16 at 15:00

Chuck Walbourn · Answer 1 · 2016-09-19T06:03:08.657

If you find that DirectXMath is a little too verbose for your tastes, take a look at SimpleMath in the DirectX Tool Kit. In particular, the Vector2 class:

struct Vector2 : public XMFLOAT2
{
    Vector2() : XMFLOAT2(0.f, 0.f) {}
    explicit Vector2(float x) : XMFLOAT2( x, x ) {}
    Vector2(float _x, float _y) : XMFLOAT2(_x, _y) {}
    explicit Vector2(_In_reads_(2) const float *pArray) : XMFLOAT2(pArray) {}
    Vector2(FXMVECTOR V) { XMStoreFloat2( this, V ); }
    Vector2(const XMFLOAT2& V) { this->x = V.x; this->y = V.y; }
    explicit Vector2(const XMVECTORF32& F) { this->x = F.f[0]; this->y = F.f[1]; }

    operator XMVECTOR() const { return XMLoadFloat2( this ); }

    // Comparison operators
    bool operator == ( const Vector2& V ) const;
    bool operator != ( const Vector2& V ) const;

    // Assignment operators
    Vector2& operator= (const Vector2& V) { x = V.x; y = V.y; return *this; }
    Vector2& operator= (const XMFLOAT2& V) { x = V.x; y = V.y; return *this; }
    Vector2& operator= (const XMVECTORF32& F) { x = F.f[0]; y = F.f[1]; return *this; }
    Vector2& operator+= (const Vector2& V);
    Vector2& operator-= (const Vector2& V);
    Vector2& operator*= (const Vector2& V);
    Vector2& operator*= (float S);
    Vector2& operator/= (float S);

    // Unary operators
    Vector2 operator+ () const { return *this; }
    Vector2 operator- () const { return Vector2(-x, -y); }

    // Vector operations
    bool InBounds( const Vector2& Bounds ) const;

    float Length() const;
    float LengthSquared() const;

    float Dot( const Vector2& V ) const;
    void Cross( const Vector2& V, Vector2& result ) const;
    Vector2 Cross( const Vector2& V ) const;

    void Normalize();
    void Normalize( Vector2& result ) const;

    void Clamp( const Vector2& vmin, const Vector2& vmax );
    void Clamp( const Vector2& vmin, const Vector2& vmax, Vector2& result ) const;

    // Static functions
    static float Distance( const Vector2& v1, const Vector2& v2 );
    static float DistanceSquared( const Vector2& v1, const Vector2& v2 );

    static void Min( const Vector2& v1, const Vector2& v2, Vector2& result );
    static Vector2 Min( const Vector2& v1, const Vector2& v2 );

    static void Max( const Vector2& v1, const Vector2& v2, Vector2& result );
    static Vector2 Max( const Vector2& v1, const Vector2& v2 );

    static void Lerp( const Vector2& v1, const Vector2& v2, float t, Vector2& result );
    static Vector2 Lerp( const Vector2& v1, const Vector2& v2, float t );

    static void SmoothStep( const Vector2& v1, const Vector2& v2, float t, Vector2& result );
    static Vector2 SmoothStep( const Vector2& v1, const Vector2& v2, float t );

    static void Barycentric( const Vector2& v1, const Vector2& v2, const Vector2& v3, float f, float g, Vector2& result );
    static Vector2 Barycentric( const Vector2& v1, const Vector2& v2, const Vector2& v3, float f, float g );

    static void CatmullRom( const Vector2& v1, const Vector2& v2, const Vector2& v3, const Vector2& v4, float t, Vector2& result );
    static Vector2 CatmullRom( const Vector2& v1, const Vector2& v2, const Vector2& v3, const Vector2& v4, float t );

    static void Hermite( const Vector2& v1, const Vector2& t1, const Vector2& v2, const Vector2& t2, float t, Vector2& result );
    static Vector2 Hermite( const Vector2& v1, const Vector2& t1, const Vector2& v2, const Vector2& t2, float t );

    static void Reflect( const Vector2& ivec, const Vector2& nvec, Vector2& result );
    static Vector2 Reflect( const Vector2& ivec, const Vector2& nvec );

    static void Refract( const Vector2& ivec, const Vector2& nvec, float refractionIndex, Vector2& result );
    static Vector2 Refract( const Vector2& ivec, const Vector2& nvec, float refractionIndex );

    static void Transform( const Vector2& v, const Quaternion& quat, Vector2& result );
    static Vector2 Transform( const Vector2& v, const Quaternion& quat );

    static void Transform( const Vector2& v, const Matrix& m, Vector2& result );
    static Vector2 Transform( const Vector2& v, const Matrix& m );
    static void Transform( _In_reads_(count) const Vector2* varray, size_t count, const Matrix& m, _Out_writes_(count) Vector2* resultArray );

    static void Transform( const Vector2& v, const Matrix& m, Vector4& result );
    static void Transform( _In_reads_(count) const Vector2* varray, size_t count, const Matrix& m, _Out_writes_(count) Vector4* resultArray );

    static void TransformNormal( const Vector2& v, const Matrix& m, Vector2& result );
    static Vector2 TransformNormal( const Vector2& v, const Matrix& m );
    static void TransformNormal( _In_reads_(count) const Vector2* varray, size_t count, const Matrix& m, _Out_writes_(count) Vector2* resultArray );

    // Constants
    static const Vector2 Zero;
    static const Vector2 One;
    static const Vector2 UnitX;
    static const Vector2 UnitY;
};

// Binary operators
Vector2 operator+ (const Vector2& V1, const Vector2& V2);
Vector2 operator- (const Vector2& V1, const Vector2& V2);
Vector2 operator* (const Vector2& V1, const Vector2& V2);
Vector2 operator* (const Vector2& V, float S);
Vector2 operator/ (const Vector2& V1, const Vector2& V2);
Vector2 operator* (float S, const Vector2& V);

The main reason that DirectXMath is so verbose in the first place is to make it very clear to the programmer when 'spilling to memory' as this tends to negatively impact the performance of SIMD code. When I moved from XNAMath to DirectXMath, I had considered adding something like the implicit conversions I used for "SimpleMath", but I wanted to make sure that any such "C++ magic" was opt-in and never a surprise for a performance-sensitive developer. SimpleMath also acts a bit like training wheels making it easier to port existing code that is not alignment-aware and morph it into something more SIMD-friendly over time.

The real performance issue with SimpleMath (and your wrapper) is that each function implementation has to do an explicit Load & Store around what is otherwise a fairly small amount of SIMD. Ideally in optimized code it would all get merged away, but in debug code they are always there. For any real performance benefit from SIMD, you want to have long runs of in-register SIMD operations between each Load & Store pair.

Another implication is that parameter passing a wrapper like Vector2 or your Vector2F will never be particular efficient. The whole reason that XMVECTOR is a typedef for __m128 rather than a struct, and the existence of FXMVECTOR, GXMVECTOR, HXMVECTOR, and CXMVECTOR is to try to optimize all the possible calling convention scenarios and in the best case get in-register passing behavior (if things don't inline). See MSDN. Really the best you can do with Vector2 is to consistently pass it const& to minimize temporaries and stack copies.

What are the perfomance implications of an inheritance like this?

1 Answers1