The way you have it written, I'd expect B::operator*
to run slightly slower. This is because the "under the hood" implementation of A::operator*
is like:
inline A A::operator*(A* this, A b)
{
A out;
out.x = this->x * b.x;
out.y = this->y * b.y;
return out;
}
So A
passes a pointer to its left-hand-side argument to the function, while B
has to make a copy of that parameter before calling the function. Both have to make copies of their right-hand-side parameters.
Your code would be much better off, and probably would implement the same for A
and B
, if you wrote it using references and made it const
correct:
struct A
{
float x, y;
inline A operator*(const A& b) const
{
A out;
out.x = x * b.x;
out.y = y * b.y;
return out;
}
}
struct B
{
float x, y;
}
inline B operator*(const B& a, const B& b)
{
B out;
out.x = a.x * b.x;
out.y = a.y * b.y;
return out;
}
You still want to return objects, not references, since the results are effectively temporaries (you're not returning a modified existing object).
Addendum
However, with the const pass-by-reference for both arguments, in B, would it make it effectively faster than A, due to the dereferencing?
First off, both involve the same dereferencing when you spell out all the code. (Remember, accessing members of this
implies a pointer dereference.)
But even then, it depends on how smart your compiler is. In this case, let's say it looks at your structure and decides it can't stuff it in a register because it's two floats, so it will use pointers to access them. So the dereferenced pointer case (which is what references get implemented as) is the best you'll get. The assembly is going to look something like this (this is pseudo-assembly-code):
// Setup for the function. Usually already done by the inlining.
r1 <- this
r2 <- &result
r3 <- &b
// Actual function.
r4 <- r1[0]
r4 <- r4 * r3[0]
r2[0] <- r4
r4 <- r1[4]
r4 <- r4 * r3[4]
r2[4] <- r4
This is assuming a RISC-like architecture (say, ARM). x86 probably uses less steps but it gets expanded to about this level of detail by the instruction decoder anyway. The point being that it's all fixed-offset dereferences of pointers in registers, which is about as fast as it will get. The optimizer can try to be smarter and implement the objects across several registers, but that kind of optimizer is a lot harder to write. (Though I have a sneaking suspicion that an LLVM-type compiler/optimizer could do that optimization easily if result
were merely a temporary object that is not preserved.)
So, since you're using this
, you have an implicit pointer dereference. But what if the object were on the stack? Doesn't help; stack variables turn into fixed-offset dereferences of the stack pointer (or frame pointer, if used). So you're dereferencing a pointer somewhere in the end, unless your compiler is bright enough to take your object and spread it across multiple registers.
Feel free to pass the -S
option to gcc
to get a disassembly of the final code to see what's really happening in your case.