2

I am implementing a particle-based fluid simulation. To represent vectors such as velocity, acceleration etc I have defined a class that looks like this

class Vec3f {
public:
    float x, y, z;

    // ... bunch of constructors, operators and utility functions
}

I'm using the library nanoflann for kd-tree searches. To accommodate for arbitrary class designs, nanoflann requires a user-defined adaptor class that the kd-tree class then queries to get info about the particle dataset. One function that the adaptor has to offer, as described in the nanoflann documentation is the following.

// Must return the dim'th component of the idx'th point in the class:
inline T kdtree_get_pt(const size_t idx, int dim) const { ... }

The problem is that this interface does not work seamlessly with the x, y, z representation. Naively, it would need to do something like this

inline float kdtree_get_pt(const size_t idx, int dim) const {
    switch(dim) {
    case 0:
        return particlearray[idx].x;
    case 1:
        return particlearray[idx].y;
    case 2:
        return particlearray[idx].z;
    }
}

Building and querying the kd-tree consumes a significant portion of my app's runtime and kd_tree_get_pt gets queried multiple times in the process so I need it to be optimized. The following solution should be faster.

class Vec3f {
public:
    float values[3];
    // ...
}

// Then inside the adaptor class
inline float kdtree_get_pt(const size_t idx, int dim) const {
    return particlearrray[idx].values[dim];
}

However, I much prefer the x, y, z interface for my equations. Now that the problem is clear, my question is how can I keep the x, y, z notation for my equations without making kdtree_get_pt suboptimal.

Solutions I have considered:

  • Vec3f has member float values[3] and getters in the form of float& x(). The function call should be optimized away completely so this almost works but I do not want to have to add the parentheses in my equations if I can avoid it. eg I want to be able to write vec1.x - vec2.x instead of vec1.x() - vec2.x(). As far as I know, C++ does not offer a way to "disguise" a function call as a member variable, excluding preprocessor macros which I do not consider a safe solution.

  • Vec3f has members float values[3] and float& x, y, z where the latter are initialized to point to the corresponding floats in the array. I thought they would be optimized away as they are known at compile time and obviously cannot change value after initialization, but even with optimizations on, MSVC++ seems to actually store the float&s as can be seen by sizeof(Vec3f) growing by 12 bytes after their addition. This doubles the storage size of my dataset which raises a concern for cache misses when working with arbitrarily large datasets.

  • kdtree_get_pt uses float& values[3] to point to x, y, z. This might eliminate the branching cost, but I don't believe the extra level of indirection, nor the need to initialize all 3 references can be optimized away so it is presumably slower than the return particlearrray[idx][dim]` version.

  • kdtree_get_pt uses reinterpret_cast or pointer magic to directly point into Vec3f's members. Given a Vec3f object's address, I believe x, y, z are guaranteed to be stored in that order with the first one stored at the same address as the one given by the & operator on the Vec3f object, but even so I'm confused as to whether there exists a well-defined way to observe them.

patatahooligan
  • 3,111
  • 1
  • 18
  • 27
  • 1
    Just out of curiosity, how much faster was the implementation using the array? To be clear: You're positive the switch-branching is your bottleneck, not something like cache locality on the `particlearrray[idx]` part? – Mark B Dec 05 '17 at 17:20
  • 2
    "The following solution should be faster." Wait - you're doing this on a hunch? I wouldn't expect to be any faster since the compiler will still have allocated contiguous memory for all 3 variables alligned in the same way however you do it. – UKMonkey Dec 05 '17 at 17:23
  • Maybe using POD unions? – Jean-Baptiste Yunès Dec 05 '17 at 17:27
  • @UKMonkey as it stands, without an answer to my question I cannot switch to the array implementation without rewriting the physics simulation part (because it uses x,y,z) so it is untested. However the contiguous memory is irrelevant. It's the fact that `kdtree_get_p` cannot be implemented as a single array indexing operation that's the problem. – patatahooligan Dec 05 '17 at 17:31
  • You should really test if `particlearrray[idx].values[dim]` would be better then `switch` if you compile your code with optimizations activate. For the shown code, modern compiler should produce the same compiled result. – t.niese Dec 05 '17 at 17:35
  • 2
    @patatahooligan it's the same thing though. ".x" vs array[0] will be accessing exactly the same piece of memory - there will be no performance change. I think you'd have been better off making a small test program to prove this to yourself. Where you might get a performance change is if you're doing a lot of tests with x, then y, then z, and you refactor your code so that all your x values are in contiguous memory (or better yet, sorted in some way so that your branch misses are minimised). – UKMonkey Dec 05 '17 at 17:36
  • I didn't even consider the switch statement being optimized away. I'm on the task of writing an example on godbolt because changing it without breaking my project would take longer. I'll check the assembly for both versions and update. – patatahooligan Dec 05 '17 at 17:56
  • The compiler cannot optimize the switch away if the index is determined at runtime. See [here](https://godbolt.org/g/6Snku8). – Jodocus Dec 05 '17 at 18:04
  • I expect nanoflann to only use compile-time constant indices so that is not a factor. I wrote a contrived godbolt example (https://godbolt.org/g/ezbnDs) to force the compiler to actually access the member variables and it seems that as long as the indices are constants, the switch statement is indeed optimized away. – patatahooligan Dec 05 '17 at 18:28
  • @patatahooligan: If you are willing to accept UB code, then your last option is the way to go. Or use a union, with a `float[3]` member. Both of which are UB, but they work in my experience. There's a question about it: https://stackoverflow.com/questions/47489087/access-members-of-class-with-pointer. If performance is important, and the best solution has UB, but actually works, I choose the performant way, rather than a slow one. Keep the slow version there in an #ifdef, so you can choose between the two versions. – geza Dec 05 '17 at 18:55
  • @geza I'm not willing to rely on UB. However, as far as I know there are well-defined cases of `reinterpret_cast` to inspect the members of a class (for example cppreference.com mentions that casting to the first member of a standard layout struct works). Unions also work for layout-compatible types it seems. The reason I didn't directly go for such a solution is that I'm having a hard time distinguishing well-defined from undefined ways to do this and I am hoping someone more knowledgeable on the subject can post an answer that is provably well-defined. – patatahooligan Dec 05 '17 at 19:01
  • @patatahooligan: as far as I know, there is no well-defined way to express this. The only solution would be the `switch` one, but unfortunately neither gcc or clang optimize the general case (even with `if (dim<0||dim>2) __builtin_unreachable();`) – geza Dec 05 '17 at 19:10
  • @patatahooligan: btw, UB means in this case, that the compiled code works, and fast. The only problem could be if a new/other compiler compiles the code, and it doesn't work any more. But it is unlikely that it will break. For example GCC explicitly supports aliasing through a union. MSVC doesn't do aliasing-based optimization at all (as far as I know). Clang is compatible with GCC in this regard. So, which these 3 compilers, there is nothing to fear of, in my opinion. And even, if this ever break, you can just activate the slow-path in the #ifdef. – geza Dec 05 '17 at 19:13

4 Answers4

4

From a software engineering standpoint, it's best to expose the data through accessor and modifier functions only.

I would suggest:

class Vec3f
{
   public:

      float& operator[](size_t index) { return values[index]; }
      float operator[](size_t index) const { return values[index]; }

      float& x() { return values[0]; }
      float x() const { return values[0]; }

      float& y() { return values[1]; }
      float y() const { return values[1]; }

      float& z() { return values[2]; }
      float z() const { return values[2]; }

   private:
      float values[3];
}

Re: kdtree_get_pt uses reinterpret_cast or pointer magic to directly point into Vec3f's members.

That's a bad idea in general. However, I don't see that being a problem with my suggestion.

R Sahu
  • 204,454
  • 14
  • 159
  • 270
  • 1
    While I agree that setters and getters are good design, they are a bit on the over-engineering side in this case. Vec3f is basically just a struct; it's main job is to hold values. Specifically, I chose not to implement it this way because the extra parentheses in every member access made the simulation code noticeably harder to read. – patatahooligan Dec 05 '17 at 18:38
  • @patatahooligan, you have to assess the trade offs and choose a solution that makes most sense to you. – R Sahu Dec 05 '17 at 18:54
2

You should always check if the switch statements will really introduce branchings in the final compiled output. A tool that might help you there is godbolt.

For both of those code snippets (random and cout have been added to prevent complete removeal of the code):

#include<cstddef>
#include<array>
#include<iostream>
#include <ctime>

class Vec3f {
public:
    float values[3];
};

struct Test {
    std::array<Vec3f,100> particlearray;

    float kdtree_get_pt(const size_t idx, int dim) const {
        return particlearray[idx].values[dim];
    }
};

int main() {
   Test t;

   std::srand(std::time(0));
   int random_variable = std::rand();

   std::cout << t.kdtree_get_pt(random_variable,0);
   std::cout << t.kdtree_get_pt(random_variable,1);
   std::cout << t.kdtree_get_pt(random_variable,2) << std::endl;

   return 0;
}

and

#include<iostream>
#include<array>
#include<ctime>
#include<cstdlib>
#include<cstddef>


class Vec3f {
public:
    float x, y, z;
};

struct Test {
    std::array<Vec3f,100> particlearray;

    float kdtree_get_pt(const size_t idx, int dim) const {
        switch(dim) {
        case 0:
            return particlearray[idx].x;
        case 1:
            return particlearray[idx].y;
        case 2:
            return particlearray[idx].z;
        }
    }
};

int main() {
   Test t;

   std::srand(std::time(0));
   int random_variable = std::rand();

   std::cout << t.kdtree_get_pt(random_variable,0);
   std::cout << t.kdtree_get_pt(random_variable,1);
   std::cout << t.kdtree_get_pt(random_variable,2) << std::endl;


   return 0;
}

The access to x, y and z or values[dim] will be compiled (by gcc 7) to:

cvtss2sd xmm0, DWORD PTR [rsp+rbx]
cvtss2sd xmm0, DWORD PTR [rsp+4+rbx]
cvtss2sd xmm0, DWORD PTR [rsp+8+rbx]

Without any branching.

t.niese
  • 39,256
  • 9
  • 74
  • 101
  • Your indices are known at compile time, so yes, switch most likely to be optimized out. For run-time indices, switch cannot be optimized out. – Severin Pappadeux Dec 05 '17 at 18:14
  • To prevent full inlining, simply use `argc` to add to your indexes. – Yakk - Adam Nevraumont Dec 05 '17 at 18:30
  • I was a bit skeptical of your example because you do not initialize `x, y, z` and the compiler could be smart and not actually read them, but it seems that the switch is optimized away even in this contrived example I made https://godbolt.org/g/ezbnDs – patatahooligan Dec 05 '17 at 18:32
  • @patatahooligan Examine https://godbolt.org/g/vkdxJ7 -- once we stop hardcoding the index, the switch comes back. – Yakk - Adam Nevraumont Dec 05 '17 at 18:40
1

However, I much prefer the x, y, z interface for my equations. Now that the problem is clear, my question is how can I keep the x, y, z

Declare x,y,z as local references before calculation:

auto& [x1, y1, z1] = v1.values;
auto& [x2, y2, z2] = v2.values;
return x1*x2 + y1*y2 + z1*z2;

For pre-C++17, you need more verbose:

auto& x = values[0];
auto& y = values[1];
auto& z = values[2];

The compiler will not need to use any storage for these references.

This of course introduces some repetition; One line (in C++17) per vector per function.

Extra parentheses introduced by your first suggestion is another good way to go. Whether the introduction of parentheses is better or worse than local reference declaration boiler plate depends on the use case and personal preference.


Edit: Another alternative: Define operator[] and use named constants for indices.

namespace axes {
    enum axes {
        x, y, z
    };
}

struct Vec3f {
    float values[3];
    float& operator[](size_t index)       { return values[index]; }
    float  operator[](size_t index) const { return values[index]; }
};

// usage
using namespace axes;
return v1[x]*v2[x] + v1[y]*v2[y] + v1[z]*v2[z];
eerorika
  • 232,697
  • 12
  • 197
  • 326
  • This is a good idea but not the most practical in my case unfortunately. There are multiple vectors in the equations and even in C++17 it would require me to devote multiple lines to unpacking `values` for each vector. – patatahooligan Dec 05 '17 at 17:34
  • @patatahooligan ah yes, multiple vectors exacerbate the repetition. – eerorika Dec 05 '17 at 17:40
1

There is known technique for mixing up access via x, y, z and array indices using union of identical data types. Resolves problem with UB, sizeof() is 12 bytes, access time is as fast as it can be, one could use SIMD vector in very similar fashion. Code below tested with VS2017

#include <iostream>
#include <type_traits>

template <int Size> struct VectorBase {
    float _data[Size];

    float operator[](int Index) {
        return _data[Index];
    }
};

template <typename VectorType, int Index> struct ScalarAccessor {
    VectorType _v;

    operator float() const {
        return _v._data[Index];
    }

    float operator = (float x) {
        _v._data[Index] = x;
        return *this;
    }
};

union uuu {
        VectorBase<3>                    xyz;
        ScalarAccessor<VectorBase<3>, 0> x;
        ScalarAccessor<VectorBase<3>, 1> y;
        ScalarAccessor<VectorBase<3>, 2> z;
};

template <int Size> struct Vector {
    union
    {
        VectorBase<3>                    xyz;
        ScalarAccessor<VectorBase<3>, 0> x;
        ScalarAccessor<VectorBase<3>, 1> y;
        ScalarAccessor<VectorBase<3>, 2> z;
    };

    float operator[](int Index) {
        return xyz[Index];
    }
};

using Vec3f = Vector<3>;

int main() {
    Vec3f a;

    a.x = 1.0f;
    a.y = a.x + 3.0f;
    a.z = a.x * 3.0f;

    std::cout << sizeof(a) << "\n";
    std::cout << a.x << " " << a.y << " " << a.z << "\n";
    std::cout << a[0] << " " << a[1] << " " << a[2] << "\n";

    std::cout << std::is_standard_layout<VectorBase<3>>::value << "\n";
    std::cout << std::is_standard_layout<ScalarAccessor<VectorBase<3>, 0>>::value << "\n";
    std::cout << std::is_standard_layout<ScalarAccessor<VectorBase<3>, 1>>::value << "\n";
    std::cout << std::is_standard_layout<ScalarAccessor<VectorBase<3>, 2>>::value << "\n";
    std::cout << std::is_standard_layout<Vec3f>::value << "\n";
    std::cout << std::is_standard_layout<uuu>::value << "\n";

    return 0;
}

UPDATE

Here some C++ standard reading

I'm relying on the definition of standard-layout type 12.7 Classes

A class S is a standard-layout class if it: (7.1) — has no non-static data members of type non-standard-layout class (or array of such types) or reference, (7.2) — has no virtual functions (13.3) and no virtual base classes (13.1), (7.3) — has the same access control (Clause 14) for all non-static data members, (7.4) — has no non-standard-layout base classes, (7.5) — has at most one base class subobject of any given type ...

It is easy to check if all proposed classes are standard-layout - I've changed the code to check for that.

They are all layout-compatible, I believe

Also If a standard-layout class object has any non-static data members, its address is the same as the address of its first non-static data member.

Union is a standard-layout class as well, so we have classes aligned in union with only data member being array of the same type and size, and looks like standard requires it to be byte-by-byte compatible

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • You just added bytes to a 3 tuple of doubles for syntax. Expensive. Maybe doing crtp and ebo you can eliminate that, but the code she be scary. – Yakk - Adam Nevraumont Dec 05 '17 at 18:14
  • @Yakk Really? Where exactly bytes are added? VS2017 returns 12 bytes from sizeof(), `x`, `y` and `z` are always using fixed indices and optimized out (I checked output), so could you please point out what is added? – Severin Pappadeux Dec 05 '17 at 18:17
  • 1
    @SeverinPappadeux Oh it is a union. Then it is just UB isn't it? As accessing `.x` when it isn't the active member isn't permitted (even though the member doesn't use `this`, you have to have a valid `this` to call a member) – Yakk - Adam Nevraumont Dec 05 '17 at 18:21
  • 1
    You say that you resolve UB by using identical types but the types are different classes. Could you elaborate on how that works? – patatahooligan Dec 05 '17 at 18:23
  • @Yakk Wrt UB - it shouldn't be, this is union of the same data type. AFAIK, UB is only if you access in union different data types - say, int and float in the same union – Severin Pappadeux Dec 05 '17 at 18:23
  • 1
    ` a.x = 1.0f` is access of the union value `x` while `x` has not been initialized. Even if legal, It proceeds to make `xyz` the active member. `x` and `xyz` cannot both be active members, and your code requires that which one is active constantly changes, while the inactive member's state remains intact. I get why it works *in practice*, I'm saying that C++ unions have stricter requirements than their typical runtime implementation requires. I mean, a naive union of an `x`/`y`/`z` struct and an array works in most cases at runtime, but is ALSO UB. – Yakk - Adam Nevraumont Dec 05 '17 at 18:32
  • @Yakk What it boils down to is `struct A {float x;};` and `struct B {float y;};` - is union of A and B cause UB if you set x and read y. I grant you if `y` is of type `int` for example it is indeed UB. But when they are of the same type and of standard-layout classes (standard section 9), it is not UB – Severin Pappadeux Dec 05 '17 at 18:39
  • @SeverinPappadeux `struct A{float x; operator float()const{return x;}}` you mean. My problem is that you invoke a method on `A` and `B`, and you cannot invoke a method on anything but an object or it is UB. And how can you justify more than one of the objects in a union existing? I believe the active union member exists, and the others don't, but I could be wrong. The common prefix rules can let you access a common prefix through very very careful means (implementations permit more than the standard does), but I don't think it permits member access here. – Yakk - Adam Nevraumont Dec 05 '17 at 18:41
  • 1
    I think what you're doing here is more akin to `struct{float}` and `struct{struct{float}}` isn't it? Does it still count as layout-compatible if one of them is nested? cppreference.com states that you can read (not write) a member of a non-active struct inside the union if that member is part of the common sequence. It does not specify whether calling a member function is well-defined. So, this might not work, but the idea of an intermediate accessor class that implicitly casts to float might be exactly what I need. – patatahooligan Dec 05 '17 at 19:06
  • @patatahooligan `Does it still count as layout-compatible if one of them is nested?` Yes, I believe so. Please check update, the only data member they have is float[3]. There is still valid question wrt active-nonactive members of the union, but frankly we (me and Yakk) might be both right as it is. It might require filling Defect Report wrt Standard - it is a wild beast, and there are hundreds DR related to wording, clarifications, cross-referencing etc – Severin Pappadeux Dec 05 '17 at 22:14
  • @Yakk added some thoughts why I think it is requires to be layout compatible - please check update – Severin Pappadeux Dec 05 '17 at 22:17
  • I believe unions allow you to access `struct foo` when what is there is `struct bar` and `foo` is the first element of `bar`. But instead we have a `foo` here, and you want to be allowed to pretend there is a `bar` basically. Which I believe is banned not due to address rules (which I agree works) bit aliasing rules and the illegality of accessing a bar that does not exist. – Yakk - Adam Nevraumont Dec 05 '17 at 23:01
  • @Yakk well, access code is dealing with arrays only. There is no inheritance. And those fixed length arrays always (as far as I can see rules for standard-layout types) are at the same address and sharing the same layout in memory. Not sure what is aliased to what - there is no type cast of any kind. The only valid point, IMHO, is wording wrt active/non-active members – Severin Pappadeux Dec 06 '17 at 01:17
  • @SeverinPappadeux I'm not convinced that `struct{float}` is layout compatible with `float` according to the standard. See [layout-compatible](https://timsong-cpp.github.io/cppwp/n4659/basic.types#def:layout-compatible). That's a requirement for `struct{float}` and `struct{struct{float}}` being compatible, which as patatahooligan points out is analogous to your structures. – eerorika Dec 06 '17 at 02:10
  • @user2079303 well, from standard `Each non-static data member is allocated as if it were the sole member of a union.` So there are two non-static members - struct and float. Because union is standard-layout type (SLT), address of union is the same as the address of the first non-static member, which altogether means address of the union is the same as the address of the float and address of the struct. Apply same logic for struct - due to SLT address of the struct is the same as the address of the float inside it, which makes it the same as the address of the union and another float. QED – Severin Pappadeux Dec 06 '17 at 02:33