C++ Small Object Performance

Question

In my program code there are various fairly small objects ranging from a byte or 2 upto about 16. E.g. Vector2 (2 * T), Vector3 (3 * T), Vector4 (4 * T), ColourI32 (4), LightValue16 (2), Tile (2), etc (byte size in brackets).

Was doing some profiling (sample based) which led me to some slower than expected functions, e.g.

//4 bits per channel natural light and artificial RGB
class LightValue16
{
...
    explicit LightValue16(uint16_t value);
    LightValue16(const LightValueF &);
    LightValue16(int r, int g, int b, int natural);

    int natural()const;
    void natural(int v);
    int artificialRed()const;
    ...
    uint16_t data;
};
...
LightValue16 World::getLight(const Vector3I &pos)
{ ... }

This function does some maths to lookup the value via a couple of arrays, with some default values for above the populated part of the world. The contents are inlined nicely and looking at the disassembly looks about as good as it can get.with about 100 instructions. However one thing stood out, on all the return sites it was implemented with something like:

mov eax, dword pyt [ebp + 8]
mov cx, word ptr[ecx + edx * 2] ; or say mov ecx, Fh
mov word ptr [eax], cx
pop ebp
ret 10h

For x64 I saw pretty much the same thing. I didn't check my GCC build, but I suspect it does pretty much the same thing.

I did a little experimenting and found by using a uint16_t return type. It actually resulted in the World::getLight function getting inlined (looked like pretty much the same core 80 instructions or so, no cheats with conditionals/loops being different) and the total CPU usage for the outer function I was investigating to go from 16.87% to 14.04% While I can do that on a case by case bases (along with trying the force inline stuff I suppose), is there any practical ways to avoid such performance issues to start with? Perhaps even get a couple of % faster across the entire code?

The best I can think of just now is to just use the primitive types in such cases ( < 4 or perhaps 8 byte objects) and move all the current member stuff into non member functions, so more like as done in C, just with namespaces.

Thinking about this I guess there is also often a cost to stuff like "t foo(const Vector3F &p)" over "t foo(float x, float y, float z)"? And if so, over a program extensively using the const&, could it add up to a significant difference?

well, the difference in your stated case is that you are returning an object with all the associated overhead vs you returning a 16 bit unsigned int. Given that for the former, you have to copy the whole object rather than just the int, I would expect that to consume a little more CPU time even when RVO comes into play. — Timo Geusch, Jul 06 '13 at 15:52
Can allocation of object in stack rather than heap, affect performance in this situation? — huseyin tugrul buyukisik, Jul 06 '13 at 15:54
Timo: why would the full object take more than 2 bytes of memory? The compiler should not put a vtable in there, I would think. — David Grayson, Jul 06 '13 at 16:00
@huseyin LightValue16 was only handled by value in the case there, there is no delete/new in the code I was optimizing, just read/write of existing stuff (World and its components) and stack temporaries. — Fire Lancer, Jul 06 '13 at 16:03
No vtable, its just 2 bytes. Within getLight itself where it did inline to start with LightValue16 basically just lives in one register or another. Its this parameter/return thing, and I really don't want to have to use those types like a C-API with at most a typedef uint16_t LightValue16 — Fire Lancer, Jul 06 '13 at 16:07
The Constructors and Operators for the LightValue16 class could make all the difference, depends on exactly how they are defined and if they are inlined/inlinable. — SoapBox, Jul 06 '13 at 17:04
The assembly showed they were inlined. Added the constructors to the code above. Indeed in a test I just did, it seems once a class/struct has a constructor it will not be passed or returned in a register regardless of size? Why is that? Seems like a massive performance hit for C++ since small wrapper objects are not uncommon (e.g. std::unique_ptr from the standard. Also seen fixed point wrappers, and things like mine here were conceptually separate values are packed into a primitive) and used throughout a code base? — Fire Lancer, Jul 06 '13 at 17:40
A non-POD type can never live in a register (where `this` would point, should you call a member function on it?) So ABI specifies passing non-PODs on the stack only. That's an unfortunate circumstance but we seem to be stuck with it at the moment. — n. m. could be an AI, Jul 06 '13 at 19:27
@n.m. unless you call a non local method, what does that matter? — Yakk - Adam Nevraumont, Jul 07 '13 at 00:24
@Yakk the compiler does not know which methods you call on a LightValue given a signature of getLight. — n. m. could be an AI, Jul 07 '13 at 03:25
@n.m. it does if it has the body and use in view. And if it knows that it can later put it anywhere from a register (requiring construction/generation/etc visible), it could store it in a register until it needs a `this`, and only then give it stack memory. All legal via `as-if`. — Yakk - Adam Nevraumont, Jul 07 '13 at 03:39
@Yakk: you can propose an ABI change for your favourite compiler if you can prove it's safe and beneficial. I can only see it's safe when copy/move constructors are trivial. NOT "in view" because "in view" is different for different TUs. Perhaps if the ctors (or even all member fns) for all recursive bases/members are "in class" one can come up with the safety conditions, but it's not trivial. — n. m. could be an AI, Jul 07 '13 at 05:15
@n.m. What ABI are you talking about that needs changing? I'm saying that the compiler can completely eliminate the use of the class, and operate on equivalent state, if the use of the class is completely in view by the `as-if` rule. And if the use of the class is only initially in-view, it can eliminate the use of the class, and when its use leaves view, it can generate data that would exist `as-if` it had always existed by moving its state-equivalence from a register to the stack. Yes, this would require the compiler realizing that the class was a thin wrapper around a 16 bit integer. — Yakk - Adam Nevraumont, Jul 07 '13 at 13:12
@Yakk: an implementation that compiles C++ to machine code needs to set up rules so that TUs compiled separately could work together. Callers and callees must follow the same calling conventions. ABI is a set of rules that govern calling conventions and other things. "The first integral parameter is passed in EAX" could be one such rule for x86. "The first parameter no larger than 32 bits is passed in EAX" could be another. Except the first one is OK and the second one cannot work in a standard-conforming implementation. "First small *POD* parameter is passed in EAX" is OK though. — n. m. could be an AI, Jul 07 '13 at 14:14
Look, we could say the exact same thing about an `int` -- what if someone called a function that took a pointer to it? If the `int` was in a register, where would that pointer point? Calling conventions *only apply when you cross translation units*, and your compiler lacks cross-translation unit optimization capabilities. If everything is visible within the current translation unit, none of that matters one bit, and is completely and utterly irrelevant. So, given that I have been talking about cases where "everything is visible", why do you keep on going on about ABIs? — Yakk - Adam Nevraumont, Jul 07 '13 at 14:27
*what if someone called a function that took a pointer to it* -- it's not just any function, it's a *copy ctor*. For any function you can just move your int (or a POD) from the register to the memory. You can't just move a non-POD, you need to call a copy ctor, and you need a source and a target that both have addresses for that. *If everything is visible within the current translation unit* -- you can break ABI for functions visible *only* within the current TU (i.e. static file-scope). Other functions must conform because you don't know what other TUs are there. — n. m. could be an AI, Jul 07 '13 at 14:42
What about the return value optimization? Are you sure it is being used? maybe the compiler don't want to miss your explixitly defined copy constructor? Try adding some observable behaviour in the copy-ctor and check it. http://en.wikipedia.org/wiki/Return_value_optimization — lkanab, Jul 30 '13 at 14:30

score 2 · Answer 1 · edited May 23 '17 at 12:13

Take a look at the Itanium C++ ABI. While your computer definitely has no Itanium processor, gcc models the x86 and x86-64 ABI very similar to the Itanium ABI. The linked section states that

However, if the return value type has a non-trivial copy constructor or destructor, [return into caller-provided memory happens]

To find out what non-trivial copy constructor or destructor means, take a look into What are Aggregates and PODs and how/why are they special?, and peek at the rules for a class to be "trivially copyable". In your case, the problem is the copy constructor you defined. It should not be needed at all, the compiler will synthesize a copy constructor that just assigns the data member as needed. If you want to explicitly state that you want a copy constructor, and you are using C++11, you can also write it down as defaulted function, which does not make it non-trivial:

LigthValue16(const LightValue16 & other) = default;

score 0 · Answer 2 · answered Aug 16 '13 at 18:12

In the comments to this question there has already been a lot of discussion, whether the compiler is allowed to handle class LightValue16 as a simple uint16_t for the function you analyzed.

If your class contains no special magic (like virtual functions) and the whole class is visible to the analyzed function, the compiler can produce code which is 100% equally efficient then just using a `uint16_t type.

The problem is "can". Although all decent compilers will usually generate code which is 100% as fast as, there will be sporadically situations where some optimization will not be applied or at least the resulting code will be different. It might just be that a parameter of a heuristic changes (e.g. inline will not be applied because just a little bit more code in some optimization step remains because of the class) or some optimization pass just really requires a plain numeric type at this stage, which is not even a real bug in the compiler. For example, if you add a "template < bool NotUsed>" to your class above this will probably change optimization steps within a compiler, although semantically your program does not change.

So, if you want to be 100% sure, use only int's or double's directly. But in 90% of the time it will be 100% as fast, only in 10% it will be only 90% of the performance, which should be O.K. for 99% percent (but not 100%) of all use-cases.

C++ Small Object Performance

2 Answers2