In my program code there are various fairly small objects ranging from a byte or 2 upto about 16. E.g. Vector2 (2 * T), Vector3 (3 * T), Vector4 (4 * T), ColourI32 (4), LightValue16 (2), Tile (2), etc (byte size in brackets).
Was doing some profiling (sample based) which led me to some slower than expected functions, e.g.
//4 bits per channel natural light and artificial RGB
class LightValue16
{
...
explicit LightValue16(uint16_t value);
LightValue16(const LightValueF &);
LightValue16(int r, int g, int b, int natural);
int natural()const;
void natural(int v);
int artificialRed()const;
...
uint16_t data;
};
...
LightValue16 World::getLight(const Vector3I &pos)
{ ... }
This function does some maths to lookup the value via a couple of arrays, with some default values for above the populated part of the world. The contents are inlined nicely and looking at the disassembly looks about as good as it can get.with about 100 instructions. However one thing stood out, on all the return sites it was implemented with something like:
mov eax, dword pyt [ebp + 8]
mov cx, word ptr[ecx + edx * 2] ; or say mov ecx, Fh
mov word ptr [eax], cx
pop ebp
ret 10h
For x64 I saw pretty much the same thing. I didn't check my GCC build, but I suspect it does pretty much the same thing.
I did a little experimenting and found by using a uint16_t return type. It actually resulted in the World::getLight function getting inlined (looked like pretty much the same core 80 instructions or so, no cheats with conditionals/loops being different) and the total CPU usage for the outer function I was investigating to go from 16.87% to 14.04% While I can do that on a case by case bases (along with trying the force inline stuff I suppose), is there any practical ways to avoid such performance issues to start with? Perhaps even get a couple of % faster across the entire code?
The best I can think of just now is to just use the primitive types in such cases ( < 4 or perhaps 8 byte objects) and move all the current member stuff into non member functions, so more like as done in C, just with namespaces.
Thinking about this I guess there is also often a cost to stuff like "t foo(const Vector3F &p)" over "t foo(float x, float y, float z)"? And if so, over a program extensively using the const&, could it add up to a significant difference?